Re: Pushdown in DataSourceV2 question

Jörn Franke Sat, 08 Dec 2018 03:33:01 -0800

BTW. Even for json a pushdown can make sense to avoid that data is unnecessary 
ending in Spark ( because it would cause unnecessary overhead). 
In the datasource v2 api you need to implement a SupportsPushDownFilter


> Am 08.12.2018 um 10:50 schrieb Noritaka Sekiyama <moomind...@gmail.com>:
> 
> Hi,
> 
> I'm a support engineer, interested in DataSourceV2.
> 
> Recently I had some pain to troubleshoot to check if pushdown is actually 
> applied or not.
> I noticed that DataFrame's explain() method shows pushdown even for JSON.
> It totally depends on DataSource side, I believe. However, I would like Spark 
> to have some way to confirm whether specific pushdown is actually applied in 
> DataSource or not.
> 
> # Example
> val df = spark.read.json("s3://sample_bucket/people.json")
> df.printSchema()
> df.filter($"age" > 20).explain()
> 
> root
>  |-- age: long (nullable = true)
>  |-- name: string (nullable = true)
> 
> == Physical Plan ==
> *Project [age#47L, name#48]
> +- *Filter (isnotnull(age#47L) && (age#47L > 20))
>    +- *FileScan json [age#47L,name#48] Batched: false, Format: JSON, 
> Location: InMemoryFileIndex[s3://sample_bucket/people.json], 
> PartitionFilters: [], PushedFilters: [IsNotNull(age), GreaterThan(age,20)], 
> ReadSchema: struct<age:bigint,name:string>
> 
> # Comments
> As you can see, PushedFilter is shown even if input data is JSON.
> Actually this pushdown is not used.
>    
> I'm wondering if it has been already discussed or not.
> If not, this is a chance to have such feature in DataSourceV2 because it 
> would require some API level changes.
> 
> 
> Warm regards,
> 
> Noritaka Sekiyama
> 

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Pushdown in DataSourceV2 question

Reply via email to