BTW. Even for json a pushdown can make sense to avoid that data is unnecessary ending in Spark ( because it would cause unnecessary overhead). In the datasource v2 api you need to implement a SupportsPushDownFilter
> Am 08.12.2018 um 10:50 schrieb Noritaka Sekiyama <moomind...@gmail.com>: > > Hi, > > I'm a support engineer, interested in DataSourceV2. > > Recently I had some pain to troubleshoot to check if pushdown is actually > applied or not. > I noticed that DataFrame's explain() method shows pushdown even for JSON. > It totally depends on DataSource side, I believe. However, I would like Spark > to have some way to confirm whether specific pushdown is actually applied in > DataSource or not. > > # Example > val df = spark.read.json("s3://sample_bucket/people.json") > df.printSchema() > df.filter($"age" > 20).explain() > > root > |-- age: long (nullable = true) > |-- name: string (nullable = true) > > == Physical Plan == > *Project [age#47L, name#48] > +- *Filter (isnotnull(age#47L) && (age#47L > 20)) > +- *FileScan json [age#47L,name#48] Batched: false, Format: JSON, > Location: InMemoryFileIndex[s3://sample_bucket/people.json], > PartitionFilters: [], PushedFilters: [IsNotNull(age), GreaterThan(age,20)], > ReadSchema: struct<age:bigint,name:string> > > # Comments > As you can see, PushedFilter is shown even if input data is JSON. > Actually this pushdown is not used. > > I'm wondering if it has been already discussed or not. > If not, this is a chance to have such feature in DataSourceV2 because it > would require some API level changes. > > > Warm regards, > > Noritaka Sekiyama > --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org