[jira] [Comment Edited] (HUDI-9088) MIT not doing partition pruning when using partition columns

Davis Zhang (Jira) Fri, 28 Feb 2025 10:37:05 -0800


    [ 
https://issues.apache.org/jira/browse/HUDI-9088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931557#comment-17931557
 ]


Davis Zhang edited comment on HUDI-9088 at 2/28/25 6:36 PM:
------------------------------------------------------------

for vanilla select query I can see partition pruning "PartitionFilters: 
[isnotnull(city#144), (city#144 = san_francisco)|#144), (city#144 = 
san_francisco)]", underlying partition pruning works
{code:java}
spark-sql (default)> explain extended select * from hudi_table where city = 
'san_francisco';
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('city = san_francisco)
   +- 'UnresolvedRelation [hudi_table], [], false


== Analyzed Logical Plan ==
_hoodie_commit_time: string, _hoodie_commit_seqno: string, _hoodie_record_key: 
string, _hoodie_partition_path: string, _hoodie_file_name: string, ts: bigint, 
uuid: string, rider: string, driver: string, fare: double, city: string
Project [_hoodie_commit_time#134, _hoodie_commit_seqno#135, 
_hoodie_record_key#136, _hoodie_partition_path#137, _hoodie_file_name#138, 
ts#139L, uuid#140, rider#141, driver#142, fare#143, city#144]
+- Filter (city#144 = san_francisco)
   +- SubqueryAlias spark_catalog.default.hudi_table
      +- Relation 
spark_catalog.default.hudi_table[_hoodie_commit_time#134,_hoodie_commit_seqno#135,_hoodie_record_key#136,_hoodie_partition_path#137,_hoodie_file_name#138,ts#139L,uuid#140,rider#141,driver#142,fare#143,city#144]
 parquet


== Optimized Logical Plan ==
Filter (isnotnull(city#144) AND (city#144 = san_francisco))
+- Relation 
spark_catalog.default.hudi_table[_hoodie_commit_time#134,_hoodie_commit_seqno#135,_hoodie_record_key#136,_hoodie_partition_path#137,_hoodie_file_name#138,ts#139L,uuid#140,rider#141,driver#142,fare#143,city#144]
 parquet


== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet 
spark_catalog.default.hudi_table[_hoodie_commit_time#134,_hoodie_commit_seqno#135,_hoodie_record_key#136,_hoodie_partition_path#137,_hoodie_file_name#138,ts#139L,uuid#140,rider#141,driver#142,fare#143,city#144]
 Batched: true, DataFilters: [], Format: Parquet, Location: HoodieFileIndex(1 
paths)[file:/tmp/lakes/hudi_table], PartitionFilters: [isnotnull(city#144), 
(city#144 = san_francisco)], PushedFilters: [], ReadSchema: 
struct<_hoodie_commit_time:string,_hoodie_commit_seqno:string,_hoodie_record_key:string,_hoodie_p...
 {code}
for MIT I didn't see much info, the query planner goes a different route. Seems 
like MIT we have fully customized implementation in 
MergeIntoHoodieTableCommand
and we missed to integrate partition pruning.
{code:java}
spark-sql (default)> explain extended MERGE INTO hudi_table AS target
                   > USING merge_source AS source
                   > ON target.uuid = source.uuid
                   > and target.city = source.city
                   > WHEN MATCHED THEN UPDATE SET target.fare = target.fare;
== Parsed Logical Plan ==
'MergeIntoTable (('target.uuid = 'source.uuid) AND ('target.city = 
'source.city)), [updateaction(None, assignment('target.fare, 'target.fare))]
:- 'SubqueryAlias target
:  +- 'UnresolvedRelation [hudi_table], [__required_write_privileges__=UPDATE], 
false
+- 'SubqueryAlias source
   +- 'UnresolvedRelation [merge_source], [], false


== Analyzed Logical Plan ==
MergeIntoHoodieTableCommand MergeIntoTable ((uuid#140 = uuid#146) AND (city#144 
= city#148)), [updateaction(None, assignment(fare#143, fare#143))]


== Optimized Logical Plan ==
MergeIntoHoodieTableCommand MergeIntoTable ((uuid#140 = uuid#146) AND (city#144 
= city#148)), [updateaction(None, assignment(fare#143, fare#143))]


== Physical Plan ==
Execute MergeIntoHoodieTableCommand
   +- MergeIntoHoodieTableCommand MergeIntoTable ((uuid#140 = uuid#146) AND 
(city#144 = city#148)), [updateaction(None, assignment(fare#143, fare#143))]

 {code}


was (Author: JIRAUSER305408):
for vanilla select query I can see partition pruning "PartitionFilters: 
[isnotnull(city#144), (city#144 = san_francisco)]", underlying partition 
pruning works
{code:java}
spark-sql (default)> explain extended select * from hudi_table where city = 
'san_francisco';
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('city = san_francisco)
   +- 'UnresolvedRelation [hudi_table], [], false


== Analyzed Logical Plan ==
_hoodie_commit_time: string, _hoodie_commit_seqno: string, _hoodie_record_key: 
string, _hoodie_partition_path: string, _hoodie_file_name: string, ts: bigint, 
uuid: string, rider: string, driver: string, fare: double, city: string
Project [_hoodie_commit_time#134, _hoodie_commit_seqno#135, 
_hoodie_record_key#136, _hoodie_partition_path#137, _hoodie_file_name#138, 
ts#139L, uuid#140, rider#141, driver#142, fare#143, city#144]
+- Filter (city#144 = san_francisco)
   +- SubqueryAlias spark_catalog.default.hudi_table
      +- Relation 
spark_catalog.default.hudi_table[_hoodie_commit_time#134,_hoodie_commit_seqno#135,_hoodie_record_key#136,_hoodie_partition_path#137,_hoodie_file_name#138,ts#139L,uuid#140,rider#141,driver#142,fare#143,city#144]
 parquet


== Optimized Logical Plan ==
Filter (isnotnull(city#144) AND (city#144 = san_francisco))
+- Relation 
spark_catalog.default.hudi_table[_hoodie_commit_time#134,_hoodie_commit_seqno#135,_hoodie_record_key#136,_hoodie_partition_path#137,_hoodie_file_name#138,ts#139L,uuid#140,rider#141,driver#142,fare#143,city#144]
 parquet


== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet 
spark_catalog.default.hudi_table[_hoodie_commit_time#134,_hoodie_commit_seqno#135,_hoodie_record_key#136,_hoodie_partition_path#137,_hoodie_file_name#138,ts#139L,uuid#140,rider#141,driver#142,fare#143,city#144]
 Batched: true, DataFilters: [], Format: Parquet, Location: HoodieFileIndex(1 
paths)[file:/tmp/lakes/hudi_table], PartitionFilters: [isnotnull(city#144), 
(city#144 = san_francisco)], PushedFilters: [], ReadSchema: 
struct<_hoodie_commit_time:string,_hoodie_commit_seqno:string,_hoodie_record_key:string,_hoodie_p...
 {code}


for MIT I didn't see much info, the query planner goes a different route. Seems 
like MIT we have fully customized implementation and we missed to integrate 
partition pruning.
{code:java}
spark-sql (default)> explain extended MERGE INTO hudi_table AS target
                   > USING merge_source AS source
                   > ON target.uuid = source.uuid
                   > and target.city = source.city
                   > WHEN MATCHED THEN UPDATE SET target.fare = target.fare;
== Parsed Logical Plan ==
'MergeIntoTable (('target.uuid = 'source.uuid) AND ('target.city = 
'source.city)), [updateaction(None, assignment('target.fare, 'target.fare))]
:- 'SubqueryAlias target
:  +- 'UnresolvedRelation [hudi_table], [__required_write_privileges__=UPDATE], 
false
+- 'SubqueryAlias source
   +- 'UnresolvedRelation [merge_source], [], false


== Analyzed Logical Plan ==
MergeIntoHoodieTableCommand MergeIntoTable ((uuid#140 = uuid#146) AND (city#144 
= city#148)), [updateaction(None, assignment(fare#143, fare#143))]


== Optimized Logical Plan ==
MergeIntoHoodieTableCommand MergeIntoTable ((uuid#140 = uuid#146) AND (city#144 
= city#148)), [updateaction(None, assignment(fare#143, fare#143))]


== Physical Plan ==
Execute MergeIntoHoodieTableCommand
   +- MergeIntoHoodieTableCommand MergeIntoTable ((uuid#140 = uuid#146) AND 
(city#144 = city#148)), [updateaction(None, assignment(fare#143, fare#143))]

 {code}

> MIT not doing partition pruning when using partition columns
> ------------------------------------------------------------
>
>                 Key: HUDI-9088
>                 URL: https://issues.apache.org/jira/browse/HUDI-9088
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: spark-sql
>            Reporter: Aditya Goenka
>            Priority: Critical
>             Fix For: 0.16.0, 1.0.2
>
>
> MIT not doing partition pruning . Reproducble code - 
> https://gist.github.com/ad1happy2go/584e0ce3731ab8be5093bbc2c86a002d



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HUDI-9088) MIT not doing partition pruning when using partition columns

Reply via email to