Re: Pushdown in DataSourceV2 question

Noritaka Sekiyama Tue, 11 Dec 2018 03:46:56 -0800

Hi,
Thank you for responding to this thread. I'm really interested in this
discussion.


My original idea might be the same as what Alessandro said, introducing a
mechanism that Spark can communicate with DataSource and get metadata which
shows if pushdown is supported or not.
I'm wondering if it will be such expensive or not..




2018年12月10日(月) 20:12 Alessandro Solimando <alessandro.solima...@gmail.com>:

> I think you are generally right, but there are so many different scenarios
> that it might not always be the best option, consider for instance a "fast"
> network in between a single data source and "Spark", lots of data, an
> "expensive" (with low selectivity) expression as Wenchen suggested.
>
> In such a case it looks to me that you end up "re-scanning" the whole
> dataset just to make sure the filter has been applied, where having such an
> info as metadata or via a communication protocol with the data source (if
> supported) would be cheaper.
>
> If there is no support at all for such a mechanism I think it could be
> worth exploring a bit more the idea. However, supporting such a mechanism
> would require some developing effort for each datasource to support (e.g.,
> asking the datasource for the physical plan applied at query time, the
> ability to parse it to extract relevant info and act on them), as I am not
> aware of any general interface for exchanging such information.
>
>
>
> On Sun, 9 Dec 2018 at 15:34, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> It is not about lying or not or trust or not. Some or all filters may not
>> be supported by a data source. Some might only be applied under certain
>> environmental conditions (eg enough memory etc).
>>
>> It is much more expensive to communicate between Spark and a data source
>> which filters have been applied or not than just checking it as Spark does.
>> Especially if you have several different data sources at the same time
>> (joins etc).
>>
>> Am 09.12.2018 um 14:30 schrieb Wenchen Fan <cloud0...@gmail.com>:
>>
>> expressions/functions can be expensive and I do think Spark should trust
>> data source and not re-apply pushed filters. If data source lies, many
>> things can go wrong...
>>
>> On Sun, Dec 9, 2018 at 8:17 PM Jörn Franke <jornfra...@gmail.com> wrote:
>>
>>> Well even if it has to apply it again, if pushdown is activated then it
>>> will be much less cost for spark to see if the filter has been applied or
>>> not. Applying the filter is negligible, what it really avoids if the file
>>> format implements it is IO cost (for reading) as well as cost for
>>> converting from the file format internal datatype to the one of Spark.
>>> Those two things are very expensive, but not the filter check. In the end,
>>> it could be also data source internal reasons not to apply a filter (there
>>> can be many depending on your scenario, the format etc). Instead of
>>> “discussing” between Spark and the data source it is much less costly that
>>> Spark checks that the filters are consistently applied.
>>>
>>> Am 09.12.2018 um 12:39 schrieb Alessandro Solimando <
>>> alessandro.solima...@gmail.com>:
>>>
>>> Hello,
>>> that's an interesting question, but after Frank's reply I am a bit
>>> puzzled.
>>>
>>> If there is no control over the pushdown status how can Spark guarantee
>>> the correctness of the final query?
>>>
>>> Consider a filter pushed down to the data source, either Spark has to
>>> know if it has been applied or not, or it has to re-apply the filter anyway
>>> (and pay the price for that).
>>>
>>> Is there any other option I am not considering?
>>>
>>> Best regards,
>>> Alessandro
>>>
>>> Il giorno Sab 8 Dic 2018, 12:32 Jörn Franke <jornfra...@gmail.com> ha
>>> scritto:
>>>
>>>> BTW. Even for json a pushdown can make sense to avoid that data is
>>>> unnecessary ending in Spark ( because it would cause unnecessary overhead).
>>>> In the datasource v2 api you need to implement a SupportsPushDownFilter
>>>>
>>>> > Am 08.12.2018 um 10:50 schrieb Noritaka Sekiyama <
>>>> moomind...@gmail.com>:
>>>> >
>>>> > Hi,
>>>> >
>>>> > I'm a support engineer, interested in DataSourceV2.
>>>> >
>>>> > Recently I had some pain to troubleshoot to check if pushdown is
>>>> actually applied or not.
>>>> > I noticed that DataFrame's explain() method shows pushdown even for
>>>> JSON.
>>>> > It totally depends on DataSource side, I believe. However, I would
>>>> like Spark to have some way to confirm whether specific pushdown is
>>>> actually applied in DataSource or not.
>>>> >
>>>> > # Example
>>>> > val df = spark.read.json("s3://sample_bucket/people.json")
>>>> > df.printSchema()
>>>> > df.filter($"age" > 20).explain()
>>>> >
>>>> > root
>>>> >  |-- age: long (nullable = true)
>>>> >  |-- name: string (nullable = true)
>>>> >
>>>> > == Physical Plan ==
>>>> > *Project [age#47L, name#48]
>>>> > +- *Filter (isnotnull(age#47L) && (age#47L > 20))
>>>> >    +- *FileScan json [age#47L,name#48] Batched: false, Format: JSON,
>>>> Location: InMemoryFileIndex[s3://sample_bucket/people.json],
>>>> PartitionFilters: [], PushedFilters: [IsNotNull(age), GreaterThan(age,20)],
>>>> ReadSchema: struct<age:bigint,name:string>
>>>> >
>>>> > # Comments
>>>> > As you can see, PushedFilter is shown even if input data is JSON.
>>>> > Actually this pushdown is not used.
>>>> >
>>>> > I'm wondering if it has been already discussed or not.
>>>> > If not, this is a chance to have such feature in DataSourceV2 because
>>>> it would require some API level changes.
>>>> >
>>>> >
>>>> > Warm regards,
>>>> >
>>>> > Noritaka Sekiyama
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>

Re: Pushdown in DataSourceV2 question

Reply via email to