[jira] [Comment Edited] (SPARK-45387) Partition key filter cannot be pushed down when using cast

Jie Han (Jira) Sun, 04 Feb 2024 19:50:19 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-45387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17814150#comment-17814150
 ]


Jie Han edited comment on SPARK-45387 at 2/5/24 3:49 AM:
---------------------------------------------------------

I can't reproduce it in spark 3.5.0.

I tried create a partitioned csv table on hdfs like follow:
{code:java}
create external table noaa (column0 string, column1 int, column2 string, 
column3 int, column4 string, column5 string, column6 string, column7 string) 
PARTITIONED BY (year string) ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION '/tmp/noaa'; 

alter table noaa add partition (year = '2019') LOCATION '/tmp/noaa/year=2019';

alter table noaa add partition (year = '2020') LOCATION 
'/tmp/noaa/year=2020';{code}
and the spark plan is 
{code:java}
scala> spark.sql("select * from noaa where year=2019 limit 10").explain(true)
== Parsed Logical Plan ==
'GlobalLimit 10
+- 'LocalLimit 10
   +- 'Project [*]
      +- 'Filter ('year = 2019)
         +- 'UnresolvedRelation [noaa], [], false== Analyzed Logical Plan ==
column0: string, column1: string, column2: string, column3: string, column4: 
string, column5: string, column6: string, column7: string, year: string
GlobalLimit 10
+- LocalLimit 10
   +- Project [column0#55, column1#56, column2#57, column3#58, column4#59, 
column5#60, column6#61, column7#62, year#63]
      +- Filter (cast(year#63 as int) = 2019)
         +- SubqueryAlias spark_catalog.default.noaa
            +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63]]== Optimized Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
   +- Filter (isnotnull(year#63) AND (cast(year#63 as int) = 2019))
      +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63], Pruned Partitions: [(year=2019)]]== Physical Plan ==
CollectLimit 10
+- Scan hive spark_catalog.default.noaa [column0#55, column1#56, column2#57, 
column3#58, column4#59, column5#60, column6#61, column7#62, year#63], 
HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63], Pruned Partitions: [(year=2019)]], 
[isnotnull(year#63), (cast(year#63 as int) = 2019)]{code}
seems that the filter has been pushed down.


was (Author: JIRAUSER285788):
I can't reproduce it at spark 3.5.0.

I tried create a partitioned csv table on hdfs like follow:
{code:java}
create external table noaa (column0 string, column1 int, column2 string, 
column3 int, column4 string, column5 string, column6 string, column7 string) 
PARTITIONED BY (year string) ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION '/tmp/noaa'; 

alter table noaa add partition (year = '2019') LOCATION '/tmp/noaa/year=2019';

alter table noaa add partition (year = '2020') LOCATION 
'/tmp/noaa/year=2020';{code}
and the spark plan is 
{code:java}
scala> spark.sql("select * from noaa where year=2019 limit 10").explain(true)
== Parsed Logical Plan ==
'GlobalLimit 10
+- 'LocalLimit 10
   +- 'Project [*]
      +- 'Filter ('year = 2019)
         +- 'UnresolvedRelation [noaa], [], false== Analyzed Logical Plan ==
column0: string, column1: string, column2: string, column3: string, column4: 
string, column5: string, column6: string, column7: string, year: string
GlobalLimit 10
+- LocalLimit 10
   +- Project [column0#55, column1#56, column2#57, column3#58, column4#59, 
column5#60, column6#61, column7#62, year#63]
      +- Filter (cast(year#63 as int) = 2019)
         +- SubqueryAlias spark_catalog.default.noaa
            +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63]]== Optimized Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
   +- Filter (isnotnull(year#63) AND (cast(year#63 as int) = 2019))
      +- HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63], Pruned Partitions: [(year=2019)]]== Physical Plan ==
CollectLimit 10
+- Scan hive spark_catalog.default.noaa [column0#55, column1#56, column2#57, 
column3#58, column4#59, column5#60, column6#61, column7#62, year#63], 
HiveTableRelation [`spark_catalog`.`default`.`noaa`, 
org.apache.hadoop.hive.serde2.OpenCSVSerde, Data Cols: [column0#55, column1#56, 
column2#57, column3#58, column4#59, column5#60, column6#61, column7#62], 
Partition Cols: [year#63], Pruned Partitions: [(year=2019)]], 
[isnotnull(year#63), (cast(year#63 as int) = 2019)]{code}
seems that the filter has been pushed down.

> Partition key filter cannot be pushed down when using cast
> ----------------------------------------------------------
>
>                 Key: SPARK-45387
>                 URL: https://issues.apache.org/jira/browse/SPARK-45387
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.1, 3.1.2, 3.3.0, 3.4.0
>            Reporter: TianyiMa
>            Priority: Critical
>         Attachments: PruneFileSourcePartitions.diff
>
>
> Suppose we have a partitioned table `table_pt` with partition colum `dt` 
> which is StringType and the table metadata is managed by Hive Metastore, if 
> we filter partition by dt = '123', this filter can be pushed down to data 
> source, but if the filter condition is number, e.g. dt = 123, that cannot be 
> pushed down to data source, causing spark to pull all of that table's 
> partition meta data to client, which is poor of performance if the table has 
> thousands of partitions and increasing the risk of hive metastore oom.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-45387) Partition key filter cannot be pushed down when using cast

Reply via email to