Hi,

I'm a support engineer, interested in DataSourceV2.

Recently I had some pain to troubleshoot to check if pushdown is actually
applied or not.
I noticed that DataFrame's explain() method shows pushdown even for JSON.
It totally depends on DataSource side, I believe. However, I would like
Spark to have some way to confirm whether specific pushdown is actually
applied in DataSource or not.

# Example
val df = spark.read.json("s3://sample_bucket/people.json")
df.printSchema()
df.filter($"age" > 20).explain()

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)

== Physical Plan ==
*Project [age#47L, name#48]
+- *Filter (isnotnull(age#47L) && (age#47L > 20))
   +- *FileScan json [age#47L,name#48] Batched: false, Format: JSON,
Location: InMemoryFileIndex[s3://sample_bucket/people.json],
PartitionFilters: [], PushedFilters: [IsNotNull(age), GreaterThan(age,20)],
ReadSchema: struct<age:bigint,name:string>

# Comments
As you can see, PushedFilter is shown even if input data is JSON.
Actually this pushdown is not used.

I'm wondering if it has been already discussed or not.
If not, this is a chance to have such feature in DataSourceV2 because it
would require some API level changes.


Warm regards,

Noritaka Sekiyama

Reply via email to