[ 
https://issues.apache.org/jira/browse/HUDI-9088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17933206#comment-17933206
 ] 

Y Ethan Guo commented on HUDI-9088:
-----------------------------------

I run the SQL statements locally can modified the INSERT INTO slight so that 
the target table has 301 partitions for easier debugging:

 
{code:java}
INSERT INTO hudi_table
SELECT 
    1695115999911 AS timestamp,  -- Creating unique timestamps based on the 
counter
    uuid() AS uuid,
    CONCAT('rider-', CAST(65 + (counter % 26) AS STRING)) AS rider,
    CONCAT('driver-', CAST(75 + (counter % 26) AS STRING)) AS driver,
    ROUND(rand() * (100 - 10) + 10, 2) AS amount,  -- Random fare between 10 
and 100
    concat('p', CAST((counter % 300) AS STRING)) AS city
FROM (SELECT explode(sequence(1, 100000)) AS counter) A; {code}
>From the Spark Job and parallelism we can see that the MERGE INTO takes the 
>input data from the source table, and Hudi did the workload profiling first 
>based on the input to figure out the affected partition(s) (Job 43) and 
>subsequent jobs only read data in the affected partition for tagging input 
>records.  The logging message may be misleading: "Load latest base files from 
>all partitions"; the code logic only looks at affected partitions based on the 
>input data, not scanning all partitions.

"Listing all files in 301 partitions" needs to be investigated.

!Screenshot 2025-03-06 at 20.40.10.png|width=1629,height=524!

!Screenshot 2025-03-06 at 22.38.04.png|width=1475,height=291!!Screenshot 
2025-03-06 at 22.38.14.png|width=777,height=342!

 

> MIT not doing partition pruning when using partition columns
> ------------------------------------------------------------
>
>                 Key: HUDI-9088
>                 URL: https://issues.apache.org/jira/browse/HUDI-9088
>             Project: Apache Hudi
>          Issue Type: Sub-task
>          Components: spark-sql
>            Reporter: Aditya Goenka
>            Assignee: Y Ethan Guo
>            Priority: Critical
>             Fix For: 0.16.0, 1.0.2
>
>         Attachments: Screenshot 2025-03-06 at 20.40.10.png, Screenshot 
> 2025-03-06 at 22.38.04.png, Screenshot 2025-03-06 at 22.38.14.png
>
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> MIT not doing partition pruning . Reproducble code - 
> https://gist.github.com/ad1happy2go/584e0ce3731ab8be5093bbc2c86a002d



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to