Hello,
that's an interesting question, but after Frank's reply I am a bit puzzled.

If there is no control over the pushdown status how can Spark guarantee the
correctness of the final query?

Consider a filter pushed down to the data source, either Spark has to know
if it has been applied or not, or it has to re-apply the filter anyway (and
pay the price for that).

Is there any other option I am not considering?

Best regards,
Alessandro

Il giorno Sab 8 Dic 2018, 12:32 Jörn Franke <jornfra...@gmail.com> ha
scritto:

> BTW. Even for json a pushdown can make sense to avoid that data is
> unnecessary ending in Spark ( because it would cause unnecessary overhead).
> In the datasource v2 api you need to implement a SupportsPushDownFilter
>
> > Am 08.12.2018 um 10:50 schrieb Noritaka Sekiyama <moomind...@gmail.com>:
> >
> > Hi,
> >
> > I'm a support engineer, interested in DataSourceV2.
> >
> > Recently I had some pain to troubleshoot to check if pushdown is
> actually applied or not.
> > I noticed that DataFrame's explain() method shows pushdown even for JSON.
> > It totally depends on DataSource side, I believe. However, I would like
> Spark to have some way to confirm whether specific pushdown is actually
> applied in DataSource or not.
> >
> > # Example
> > val df = spark.read.json("s3://sample_bucket/people.json")
> > df.printSchema()
> > df.filter($"age" > 20).explain()
> >
> > root
> >  |-- age: long (nullable = true)
> >  |-- name: string (nullable = true)
> >
> > == Physical Plan ==
> > *Project [age#47L, name#48]
> > +- *Filter (isnotnull(age#47L) && (age#47L > 20))
> >    +- *FileScan json [age#47L,name#48] Batched: false, Format: JSON,
> Location: InMemoryFileIndex[s3://sample_bucket/people.json],
> PartitionFilters: [], PushedFilters: [IsNotNull(age), GreaterThan(age,20)],
> ReadSchema: struct<age:bigint,name:string>
> >
> > # Comments
> > As you can see, PushedFilter is shown even if input data is JSON.
> > Actually this pushdown is not used.
> >
> > I'm wondering if it has been already discussed or not.
> > If not, this is a chance to have such feature in DataSourceV2 because it
> would require some API level changes.
> >
> >
> > Warm regards,
> >
> > Noritaka Sekiyama
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Reply via email to