[jira] [Updated] (SPARK-38041) DataFilter pushed down with PartitionFilter

Hyukjin Kwon (Jira) Thu, 27 Jan 2022 18:14:04 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-38041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon updated SPARK-38041:
---------------------------------
    Description: 
At present, the Filter is divided into DataFilter and PartitionFilter when it 
is pushed down, but when the Filter removes the PartitionFilter, it means that 
all Partitions will scan all DataFilter conditions, which may cause full data 
scan.

Here is a example.

before
{code:java}
== Physical Plan ==
*(1) Filter (((a#0 < 10) AND (c#2 = 0)) OR (((a#0 >= 10) AND (c#2 >= 1)) AND 
(c#2 < 3)))
+- *(1) ColumnarToRow
   +- FileScan parquet datasources.test_push_down[a#0,b#1,c#2] Batched: true, 
DataFilters: [((a#0 < 10) OR (a#0 >= 10))], Format: Parquet, Location: 
InMemoryFileIndex(0 paths)[], PartitionFilters: [((c#2 = 0) OR ((c#2 >= 1) AND 
(c#2 < 3)))], PushedFilters: [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], 
ReadSchema: struct<a:int,b:int> {code}
after
{code:java}
== Physical Plan ==
*(1) Filter (((a#0 < 10) AND (c#2 = 0)) OR (((a#0 >= 10) AND (c#2 >= 1)) AND 
(c#2 < 3)))
+- *(1) ColumnarToRow
   +- FileScan parquet datasources.test_push_down[a#0,b#1,c#2] Batched: true, 
DataFilters: [(((a#0 < 10) AND (c#2 = 0)) OR (((a#0 >= 10) AND (c#2 >= 1)) AND 
(c#2 < 3)))], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
PartitionFilters: [((c#2 = 0) OR ((c#2 >= 1) AND (c#2 < 3)))], PushedFilters: 
[Or(LessThan(a,10),GreaterThanOrEqual(a,10))], ReadSchema: struct<a:int,b:int>  
{code}

  was:
At present, the Filter is divided into DataFilter and PartitionFilter when it 
is pushed down, but when the Filter removes the PartitionFilter, it means that 
all Partitions will scan all DataFilter conditions, which may cause full data 
scan.

Here is a example.

before
{code:java}
== Physical Plan ==
*(1) Filter (((a#0 < 10) AND (c#2 = 0)) OR (((a#0 >= 10) AND (c#2 >= 1)) AND 
(c#2 < 3)))
+- *(1) ColumnarToRow
   +- FileScan parquet datasources.test_push_down[a#0,b#1,c#2] Batched: true, 
DataFilters: [((a#0 < 10) OR (a#0 >= 10))], Format: Parquet, Location: 
InMemoryFileIndex(0 paths)[], PartitionFilters: [((c#2 = 0) OR ((c#2 >= 1) AND 
(c#2 < 3)))], PushedFilters: [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], 
ReadSchema: struct<a:int,b:int> {code}
after
{code:java}
== Physical Plan == *(1) Filter (((a#0 < 10) AND (c#2 = 0)) OR (((a#0 >= 10) 
AND (c#2 >= 1)) AND (c#2 < 3))) +- *(1) ColumnarToRow    +- FileScan parquet 
datasources.test_push_down[a#0,b#1,c#2] Batched: true, DataFilters: [(((a#0 < 
10) AND (c#2 = 0)) OR (((a#0 >= 10) AND (c#2 >= 1)) AND (c#2 < 3)))], Format: 
Parquet, Location: InMemoryFileIndex(0 paths)[], PartitionFilters: [((c#2 = 0) 
OR ((c#2 >= 1) AND (c#2 < 3)))], PushedFilters: 
[Or(LessThan(a,10),GreaterThanOrEqual(a,10))], ReadSchema: struct<a:int,b:int>  
{code}


> DataFilter pushed down with PartitionFilter
> -------------------------------------------
>
>                 Key: SPARK-38041
>                 URL: https://issues.apache.org/jira/browse/SPARK-38041
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Jackey Lee
>            Priority: Major
>
> At present, the Filter is divided into DataFilter and PartitionFilter when it 
> is pushed down, but when the Filter removes the PartitionFilter, it means 
> that all Partitions will scan all DataFilter conditions, which may cause full 
> data scan.
> Here is a example.
> before
> {code:java}
> == Physical Plan ==
> *(1) Filter (((a#0 < 10) AND (c#2 = 0)) OR (((a#0 >= 10) AND (c#2 >= 1)) AND 
> (c#2 < 3)))
> +- *(1) ColumnarToRow
>    +- FileScan parquet datasources.test_push_down[a#0,b#1,c#2] Batched: true, 
> DataFilters: [((a#0 < 10) OR (a#0 >= 10))], Format: Parquet, Location: 
> InMemoryFileIndex(0 paths)[], PartitionFilters: [((c#2 = 0) OR ((c#2 >= 1) 
> AND (c#2 < 3)))], PushedFilters: 
> [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], ReadSchema: 
> struct<a:int,b:int> {code}
> after
> {code:java}
> == Physical Plan ==
> *(1) Filter (((a#0 < 10) AND (c#2 = 0)) OR (((a#0 >= 10) AND (c#2 >= 1)) AND 
> (c#2 < 3)))
> +- *(1) ColumnarToRow
>    +- FileScan parquet datasources.test_push_down[a#0,b#1,c#2] Batched: true, 
> DataFilters: [(((a#0 < 10) AND (c#2 = 0)) OR (((a#0 >= 10) AND (c#2 >= 1)) 
> AND (c#2 < 3)))], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], 
> PartitionFilters: [((c#2 = 0) OR ((c#2 >= 1) AND (c#2 < 3)))], PushedFilters: 
> [Or(LessThan(a,10),GreaterThanOrEqual(a,10))], ReadSchema: 
> struct<a:int,b:int>  {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-38041) DataFilter pushed down with PartitionFilter

Reply via email to