Jackey Lee created SPARK-38041:
----------------------------------
Summary: DataFilter pushed down with PartitionFilter
Key: SPARK-38041
URL: https://issues.apache.org/jira/browse/SPARK-38041
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 3.3.0
Reporter: Jackey Lee
At present, the Filter is divided into DataFilter and PartitionFilter when it
is pushed down, but when the Filter removes the PartitionFilter, it means that
all Partitions will scan all DataFilter conditions, which may cause full data
scan.
Here is a example.
before
{code:java}
== Physical Plan ==
*(1) Filter (((a#0 < 10) AND (c#2 = 0)) OR (((a#0 >= 10) AND (c#2 >= 1)) AND
(c#2 < 3)))
+- *(1) ColumnarToRow
+- FileScan parquet datasources.test_push_down[a#0,b#1,c#2] Batched: true,
DataFilters: [((a#0 < 10) OR (a#0 >= 10))], Format: Parquet, Location:
InMemoryFileIndex(0 paths)[], PartitionFilters: [((c#2 = 0) OR ((c#2 >= 1) AND
(c#2 < 3)))], PushedFilters: [Or(LessThan(a,10),GreaterThanOrEqual(a,10))],
ReadSchema: struct<a:int,b:int> {code}
after
{code:java}
== Physical Plan == *(1) Filter (((a#0 < 10) AND (c#2 = 0)) OR (((a#0 >= 10)
AND (c#2 >= 1)) AND (c#2 < 3))) +- *(1) ColumnarToRow +- FileScan parquet
datasources.test_push_down[a#0,b#1,c#2] Batched: true, DataFilters: [(((a#0 <
10) AND (c#2 = 0)) OR (((a#0 >= 10) AND (c#2 >= 1)) AND (c#2 < 3)))], Format:
Parquet, Location: InMemoryFileIndex(0 paths)[], PartitionFilters: [((c#2 = 0)
OR ((c#2 >= 1) AND (c#2 < 3)))], PushedFilters:
[Or(LessThan(a,10),GreaterThanOrEqual(a,10))], ReadSchema: struct<a:int,b:int>
{code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]