[ 
https://issues.apache.org/jira/browse/SPARK-38041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38041:
------------------------------------

    Assignee: Apache Spark

> DataFilter pushed down with PartitionFilter
> -------------------------------------------
>
>                 Key: SPARK-38041
>                 URL: https://issues.apache.org/jira/browse/SPARK-38041
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Jackey Lee
>            Assignee: Apache Spark
>            Priority: Major
>
> At present, the Filter is divided into DataFilter and PartitionFilter when it 
> is pushed down, but when the Filter removes the PartitionFilter, it means 
> that all Partitions will scan all DataFilter conditions, which may cause full 
> data scan.
> Here is a example.
> before
> {code:java}
> == Physical Plan ==
> *(1) Filter (((a#40L < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 
> 1)) AND (c#42 < 3)))
> +- *(1) ColumnarToRow
>    +- BatchScan[a#40L, b#41L, c#42, d#43] ParquetScan DataFilters: [(((a#40L 
> < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) AND (c#42 < 3)))], 
> Format: parquet, Location: InMemoryFileIndex(1 paths)[path, PartitionFilters: 
> [((c#42 = 0) OR ((c#42 >= 1) AND (c#42 < 3)))], PushedAggregation: [], 
> PushedFilters: [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], PushedGroupBy: 
> [], ReadSchema: struct<a:bigint,b:bigint>, PushedFilters: 
> [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], PushedAggregation: [], 
> PushedGroupBy: [] RuntimeFilters: []
> {code}
> after
> {code:java}
> == Physical Plan ==
> *(1) Filter (((a#40L < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 
> 1)) AND (c#42 < 3)))
> +- *(1) ColumnarToRow
>    +- BatchScan[a#40L, b#41L, c#42, d#43] ParquetScan DataFilters: [(((a#40L 
> < 10) AND (c#42 = 0)) OR (((a#40L >= 10) AND (c#42 >= 1)) AND (c#42 < 3)))], 
> Format: parquet, Location: InMemoryFileIndex(1 paths)[path, PartitionFilters: 
> [((c#42 = 0) OR ((c#42 >= 1) AND (c#42 < 3)))], PushedAggregation: [], 
> PushedFilters: 
> [Or(And(LessThan(a,10),EqualTo(c,0)),And(And(GreaterThanOrEqual(a,10),GreaterThanOrEqual(c,1)),Le...,
>  PushedGroupBy: [], ReadSchema: struct<a:bigint,b:bigint>, PushedFilters: 
> [Or(And(LessThan(a,10),EqualTo(c,0)),And(And(GreaterThanOrEqual(a,10),GreaterThanOrEqual(c,1)),LessThan(c,3)))],
>  PushedAggregation: [], PushedGroupBy: [] RuntimeFilters: [] {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to