[
https://issues.apache.org/jira/browse/HUDI-9088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17933206#comment-17933206
]
Y Ethan Guo commented on HUDI-9088:
-----------------------------------
I run the SQL statements locally can modified the INSERT INTO slight so that
the target table has 301 partitions for easier debugging:
{code:java}
INSERT INTO hudi_table
SELECT
1695115999911 AS timestamp, -- Creating unique timestamps based on the
counter
uuid() AS uuid,
CONCAT('rider-', CAST(65 + (counter % 26) AS STRING)) AS rider,
CONCAT('driver-', CAST(75 + (counter % 26) AS STRING)) AS driver,
ROUND(rand() * (100 - 10) + 10, 2) AS amount, -- Random fare between 10
and 100
concat('p', CAST((counter % 300) AS STRING)) AS city
FROM (SELECT explode(sequence(1, 100000)) AS counter) A; {code}
>From the Spark Job and parallelism we can see that the MERGE INTO takes the
>input data from the source table, and Hudi did the workload profiling first
>based on the input to figure out the affected partition(s) (Job 43) and
>subsequent jobs only read data in the affected partition for tagging input
>records. The logging message may be misleading: "Load latest base files from
>all partitions"; the code logic only looks at affected partitions based on the
>input data, not scanning all partitions.
"Listing all files in 301 partitions" needs to be investigated.
!Screenshot 2025-03-06 at 20.40.10.png|width=1629,height=524!
!Screenshot 2025-03-06 at 22.38.04.png|width=1475,height=291!!Screenshot
2025-03-06 at 22.38.14.png|width=777,height=342!
> MIT not doing partition pruning when using partition columns
> ------------------------------------------------------------
>
> Key: HUDI-9088
> URL: https://issues.apache.org/jira/browse/HUDI-9088
> Project: Apache Hudi
> Issue Type: Sub-task
> Components: spark-sql
> Reporter: Aditya Goenka
> Assignee: Y Ethan Guo
> Priority: Critical
> Fix For: 0.16.0, 1.0.2
>
> Attachments: Screenshot 2025-03-06 at 20.40.10.png, Screenshot
> 2025-03-06 at 22.38.04.png, Screenshot 2025-03-06 at 22.38.14.png
>
> Original Estimate: 6h
> Remaining Estimate: 6h
>
> MIT not doing partition pruning . Reproducble code -
> https://gist.github.com/ad1happy2go/584e0ce3731ab8be5093bbc2c86a002d
--
This message was sent by Atlassian Jira
(v8.20.10#820010)