[
https://issues.apache.org/jira/browse/HUDI-9088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931557#comment-17931557
]
Davis Zhang edited comment on HUDI-9088 at 2/28/25 6:36 PM:
------------------------------------------------------------
for vanilla select query I can see partition pruning "PartitionFilters:
[isnotnull(city#144), (city#144 = san_francisco)|#144), (city#144 =
san_francisco)]", underlying partition pruning works
{code:java}
spark-sql (default)> explain extended select * from hudi_table where city =
'san_francisco';
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('city = san_francisco)
+- 'UnresolvedRelation [hudi_table], [], false
== Analyzed Logical Plan ==
_hoodie_commit_time: string, _hoodie_commit_seqno: string, _hoodie_record_key:
string, _hoodie_partition_path: string, _hoodie_file_name: string, ts: bigint,
uuid: string, rider: string, driver: string, fare: double, city: string
Project [_hoodie_commit_time#134, _hoodie_commit_seqno#135,
_hoodie_record_key#136, _hoodie_partition_path#137, _hoodie_file_name#138,
ts#139L, uuid#140, rider#141, driver#142, fare#143, city#144]
+- Filter (city#144 = san_francisco)
+- SubqueryAlias spark_catalog.default.hudi_table
+- Relation
spark_catalog.default.hudi_table[_hoodie_commit_time#134,_hoodie_commit_seqno#135,_hoodie_record_key#136,_hoodie_partition_path#137,_hoodie_file_name#138,ts#139L,uuid#140,rider#141,driver#142,fare#143,city#144]
parquet
== Optimized Logical Plan ==
Filter (isnotnull(city#144) AND (city#144 = san_francisco))
+- Relation
spark_catalog.default.hudi_table[_hoodie_commit_time#134,_hoodie_commit_seqno#135,_hoodie_record_key#136,_hoodie_partition_path#137,_hoodie_file_name#138,ts#139L,uuid#140,rider#141,driver#142,fare#143,city#144]
parquet
== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet
spark_catalog.default.hudi_table[_hoodie_commit_time#134,_hoodie_commit_seqno#135,_hoodie_record_key#136,_hoodie_partition_path#137,_hoodie_file_name#138,ts#139L,uuid#140,rider#141,driver#142,fare#143,city#144]
Batched: true, DataFilters: [], Format: Parquet, Location: HoodieFileIndex(1
paths)[file:/tmp/lakes/hudi_table], PartitionFilters: [isnotnull(city#144),
(city#144 = san_francisco)], PushedFilters: [], ReadSchema:
struct<_hoodie_commit_time:string,_hoodie_commit_seqno:string,_hoodie_record_key:string,_hoodie_p...
{code}
for MIT I didn't see much info, the query planner goes a different route. Seems
like MIT we have fully customized implementation in
MergeIntoHoodieTableCommand
and we missed to integrate partition pruning.
{code:java}
spark-sql (default)> explain extended MERGE INTO hudi_table AS target
> USING merge_source AS source
> ON target.uuid = source.uuid
> and target.city = source.city
> WHEN MATCHED THEN UPDATE SET target.fare = target.fare;
== Parsed Logical Plan ==
'MergeIntoTable (('target.uuid = 'source.uuid) AND ('target.city =
'source.city)), [updateaction(None, assignment('target.fare, 'target.fare))]
:- 'SubqueryAlias target
: +- 'UnresolvedRelation [hudi_table], [__required_write_privileges__=UPDATE],
false
+- 'SubqueryAlias source
+- 'UnresolvedRelation [merge_source], [], false
== Analyzed Logical Plan ==
MergeIntoHoodieTableCommand MergeIntoTable ((uuid#140 = uuid#146) AND (city#144
= city#148)), [updateaction(None, assignment(fare#143, fare#143))]
== Optimized Logical Plan ==
MergeIntoHoodieTableCommand MergeIntoTable ((uuid#140 = uuid#146) AND (city#144
= city#148)), [updateaction(None, assignment(fare#143, fare#143))]
== Physical Plan ==
Execute MergeIntoHoodieTableCommand
+- MergeIntoHoodieTableCommand MergeIntoTable ((uuid#140 = uuid#146) AND
(city#144 = city#148)), [updateaction(None, assignment(fare#143, fare#143))]
{code}
was (Author: JIRAUSER305408):
for vanilla select query I can see partition pruning "PartitionFilters:
[isnotnull(city#144), (city#144 = san_francisco)]", underlying partition
pruning works
{code:java}
spark-sql (default)> explain extended select * from hudi_table where city =
'san_francisco';
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('city = san_francisco)
+- 'UnresolvedRelation [hudi_table], [], false
== Analyzed Logical Plan ==
_hoodie_commit_time: string, _hoodie_commit_seqno: string, _hoodie_record_key:
string, _hoodie_partition_path: string, _hoodie_file_name: string, ts: bigint,
uuid: string, rider: string, driver: string, fare: double, city: string
Project [_hoodie_commit_time#134, _hoodie_commit_seqno#135,
_hoodie_record_key#136, _hoodie_partition_path#137, _hoodie_file_name#138,
ts#139L, uuid#140, rider#141, driver#142, fare#143, city#144]
+- Filter (city#144 = san_francisco)
+- SubqueryAlias spark_catalog.default.hudi_table
+- Relation
spark_catalog.default.hudi_table[_hoodie_commit_time#134,_hoodie_commit_seqno#135,_hoodie_record_key#136,_hoodie_partition_path#137,_hoodie_file_name#138,ts#139L,uuid#140,rider#141,driver#142,fare#143,city#144]
parquet
== Optimized Logical Plan ==
Filter (isnotnull(city#144) AND (city#144 = san_francisco))
+- Relation
spark_catalog.default.hudi_table[_hoodie_commit_time#134,_hoodie_commit_seqno#135,_hoodie_record_key#136,_hoodie_partition_path#137,_hoodie_file_name#138,ts#139L,uuid#140,rider#141,driver#142,fare#143,city#144]
parquet
== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet
spark_catalog.default.hudi_table[_hoodie_commit_time#134,_hoodie_commit_seqno#135,_hoodie_record_key#136,_hoodie_partition_path#137,_hoodie_file_name#138,ts#139L,uuid#140,rider#141,driver#142,fare#143,city#144]
Batched: true, DataFilters: [], Format: Parquet, Location: HoodieFileIndex(1
paths)[file:/tmp/lakes/hudi_table], PartitionFilters: [isnotnull(city#144),
(city#144 = san_francisco)], PushedFilters: [], ReadSchema:
struct<_hoodie_commit_time:string,_hoodie_commit_seqno:string,_hoodie_record_key:string,_hoodie_p...
{code}
for MIT I didn't see much info, the query planner goes a different route. Seems
like MIT we have fully customized implementation and we missed to integrate
partition pruning.
{code:java}
spark-sql (default)> explain extended MERGE INTO hudi_table AS target
> USING merge_source AS source
> ON target.uuid = source.uuid
> and target.city = source.city
> WHEN MATCHED THEN UPDATE SET target.fare = target.fare;
== Parsed Logical Plan ==
'MergeIntoTable (('target.uuid = 'source.uuid) AND ('target.city =
'source.city)), [updateaction(None, assignment('target.fare, 'target.fare))]
:- 'SubqueryAlias target
: +- 'UnresolvedRelation [hudi_table], [__required_write_privileges__=UPDATE],
false
+- 'SubqueryAlias source
+- 'UnresolvedRelation [merge_source], [], false
== Analyzed Logical Plan ==
MergeIntoHoodieTableCommand MergeIntoTable ((uuid#140 = uuid#146) AND (city#144
= city#148)), [updateaction(None, assignment(fare#143, fare#143))]
== Optimized Logical Plan ==
MergeIntoHoodieTableCommand MergeIntoTable ((uuid#140 = uuid#146) AND (city#144
= city#148)), [updateaction(None, assignment(fare#143, fare#143))]
== Physical Plan ==
Execute MergeIntoHoodieTableCommand
+- MergeIntoHoodieTableCommand MergeIntoTable ((uuid#140 = uuid#146) AND
(city#144 = city#148)), [updateaction(None, assignment(fare#143, fare#143))]
{code}
> MIT not doing partition pruning when using partition columns
> ------------------------------------------------------------
>
> Key: HUDI-9088
> URL: https://issues.apache.org/jira/browse/HUDI-9088
> Project: Apache Hudi
> Issue Type: Improvement
> Components: spark-sql
> Reporter: Aditya Goenka
> Priority: Critical
> Fix For: 0.16.0, 1.0.2
>
>
> MIT not doing partition pruning . Reproducble code -
> https://gist.github.com/ad1happy2go/584e0ce3731ab8be5093bbc2c86a002d
--
This message was sent by Atlassian Jira
(v8.20.10#820010)