Hello, that's an interesting question, but after Frank's reply I am a bit puzzled.
If there is no control over the pushdown status how can Spark guarantee the correctness of the final query? Consider a filter pushed down to the data source, either Spark has to know if it has been applied or not, or it has to re-apply the filter anyway (and pay the price for that). Is there any other option I am not considering? Best regards, Alessandro Il giorno Sab 8 Dic 2018, 12:32 Jörn Franke <jornfra...@gmail.com> ha scritto: > BTW. Even for json a pushdown can make sense to avoid that data is > unnecessary ending in Spark ( because it would cause unnecessary overhead). > In the datasource v2 api you need to implement a SupportsPushDownFilter > > > Am 08.12.2018 um 10:50 schrieb Noritaka Sekiyama <moomind...@gmail.com>: > > > > Hi, > > > > I'm a support engineer, interested in DataSourceV2. > > > > Recently I had some pain to troubleshoot to check if pushdown is > actually applied or not. > > I noticed that DataFrame's explain() method shows pushdown even for JSON. > > It totally depends on DataSource side, I believe. However, I would like > Spark to have some way to confirm whether specific pushdown is actually > applied in DataSource or not. > > > > # Example > > val df = spark.read.json("s3://sample_bucket/people.json") > > df.printSchema() > > df.filter($"age" > 20).explain() > > > > root > > |-- age: long (nullable = true) > > |-- name: string (nullable = true) > > > > == Physical Plan == > > *Project [age#47L, name#48] > > +- *Filter (isnotnull(age#47L) && (age#47L > 20)) > > +- *FileScan json [age#47L,name#48] Batched: false, Format: JSON, > Location: InMemoryFileIndex[s3://sample_bucket/people.json], > PartitionFilters: [], PushedFilters: [IsNotNull(age), GreaterThan(age,20)], > ReadSchema: struct<age:bigint,name:string> > > > > # Comments > > As you can see, PushedFilter is shown even if input data is JSON. > > Actually this pushdown is not used. > > > > I'm wondering if it has been already discussed or not. > > If not, this is a chance to have such feature in DataSourceV2 because it > would require some API level changes. > > > > > > Warm regards, > > > > Noritaka Sekiyama > > > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >