[
https://issues.apache.org/jira/browse/SPARK-45387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814150#comment-17814150
]
Jie Han edited comment on SPARK-45387 at 2/5/24 3:49 AM:
---------------------------------------------------------
I can't reproduce it in spark 3.5.0.
I tried create a partitioned csv table on hdfs like follow:
{code:java}
create external table noaa (column0 string, column1 int, column2 string,
column3 int, column4 string, column5 string, column6 string, column7 string)
PARTITIONED BY (year string) ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION '/tmp/noaa';
alter table noaa add partition (year = '2019') LOCATION '/tmp/noaa/year=2019';
alter table noaa add partition (year = '2020') LOCATION
'/tmp/noaa/year=2020';{code}
and the spark plan is
{code:java}
scala> spark.sql("select * from noaa where year=2019 limit 10").explain(true)
== Parsed Logical Plan ==
'GlobalLimit 10
+- 'LocalLimit 10
+- 'Project [*]
+- 'Filter ('year = 2019)
+- 'UnresolvedRelation [noaa], [], false== Analyzed Logical Plan ==
column0: string, column1: string, column2: string, column3: string, column4:
string, column5: string, column6: string, column7: string, year: string
GlobalLimit 10
+- LocalLimit 10
+- Project [column0#55, column1#56, column2#57, column3#58, column4#59,
column5#60, column6#61, column7#62, year#63]
+- Filter (cast(year#63 as int) = 2019)
+- SubqueryAlias spark_catalog.default.noaa
+- HiveTableRelation [`spark_catalog`.`default`.`noaa`,
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56,
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62],
Partition Cols: [year#63]]== Optimized Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
+- Filter (isnotnull(year#63) AND (cast(year#63 as int) = 2019))
+- HiveTableRelation [`spark_catalog`.`default`.`noaa`,
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56,
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62],
Partition Cols: [year#63], Pruned Partitions: [(year=2019)]]== Physical Plan ==
CollectLimit 10
+- Scan hive spark_catalog.default.noaa [column0#55, column1#56, column2#57,
column3#58, column4#59, column5#60, column6#61, column7#62, year#63],
HiveTableRelation [`spark_catalog`.`default`.`noaa`,
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56,
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62],
Partition Cols: [year#63], Pruned Partitions: [(year=2019)]],
[isnotnull(year#63), (cast(year#63 as int) = 2019)]{code}
seems that the filter has been pushed down.
was (Author: JIRAUSER285788):
I can't reproduce it at spark 3.5.0.
I tried create a partitioned csv table on hdfs like follow:
{code:java}
create external table noaa (column0 string, column1 int, column2 string,
column3 int, column4 string, column5 string, column6 string, column7 string)
PARTITIONED BY (year string) ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION '/tmp/noaa';
alter table noaa add partition (year = '2019') LOCATION '/tmp/noaa/year=2019';
alter table noaa add partition (year = '2020') LOCATION
'/tmp/noaa/year=2020';{code}
and the spark plan is
{code:java}
scala> spark.sql("select * from noaa where year=2019 limit 10").explain(true)
== Parsed Logical Plan ==
'GlobalLimit 10
+- 'LocalLimit 10
+- 'Project [*]
+- 'Filter ('year = 2019)
+- 'UnresolvedRelation [noaa], [], false== Analyzed Logical Plan ==
column0: string, column1: string, column2: string, column3: string, column4:
string, column5: string, column6: string, column7: string, year: string
GlobalLimit 10
+- LocalLimit 10
+- Project [column0#55, column1#56, column2#57, column3#58, column4#59,
column5#60, column6#61, column7#62, year#63]
+- Filter (cast(year#63 as int) = 2019)
+- SubqueryAlias spark_catalog.default.noaa
+- HiveTableRelation [`spark_catalog`.`default`.`noaa`,
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56,
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62],
Partition Cols: [year#63]]== Optimized Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
+- Filter (isnotnull(year#63) AND (cast(year#63 as int) = 2019))
+- HiveTableRelation [`spark_catalog`.`default`.`noaa`,
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56,
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62],
Partition Cols: [year#63], Pruned Partitions: [(year=2019)]]== Physical Plan ==
CollectLimit 10
+- Scan hive spark_catalog.default.noaa [column0#55, column1#56, column2#57,
column3#58, column4#59, column5#60, column6#61, column7#62, year#63],
HiveTableRelation [`spark_catalog`.`default`.`noaa`,
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56,
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62],
Partition Cols: [year#63], Pruned Partitions: [(year=2019)]],
[isnotnull(year#63), (cast(year#63 as int) = 2019)]{code}
seems that the filter has been pushed down.
> Partition key filter cannot be pushed down when using cast
> ----------------------------------------------------------
>
> Key: SPARK-45387
> URL: https://issues.apache.org/jira/browse/SPARK-45387
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.1.1, 3.1.2, 3.3.0, 3.4.0
> Reporter: TianyiMa
> Priority: Critical
> Attachments: PruneFileSourcePartitions.diff
>
>
> Suppose we have a partitioned table `table_pt` with partition colum `dt`
> which is StringType and the table metadata is managed by Hive Metastore, if
> we filter partition by dt = '123', this filter can be pushed down to data
> source, but if the filter condition is number, e.g. dt = 123, that cannot be
> pushed down to data source, causing spark to pull all of that table's
> partition meta data to client, which is poor of performance if the table has
> thousands of partitions and increasing the risk of hive metastore oom.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]