[ 
https://issues.apache.org/jira/browse/HUDI-9665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-9665:
---------------------------------
    Labels: pull-request-available  (was: )

> Repartition the write status RDD for MDT DAG to avoid long processing 
> durations
> -------------------------------------------------------------------------------
>
>                 Key: HUDI-9665
>                 URL: https://issues.apache.org/jira/browse/HUDI-9665
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: Rajesh Mahindra
>            Assignee: sivabalan narayanan
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.1.0
>
>
> After data table write status is collected, and prepared to be written to the 
> MDT table, if the parallelism of write status RDD is too high (for instance, 
> if 100,000's of files were touched), then the entire workload profile stages 
> for MDT DAG could take 10's of mins. Put up a PR with a small fix that 
> repartitions the write status RDD to a configurable max partitions to reduce 
> latencies.
> For instance, this is what we did in a POC. Added following code to execute() 
> in 
> hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java
> if (table.isMetadataTable() && 
> config.getProps().getInteger("hoodie.metadata.temp.repartition.parallelism", 
> 0) > 0) {
>       inputRecords = 
> inputRecords.repartition(config.getProps().getInteger("hoodie.metadata.temp.repartition.parallelism",
>  16801));
>     }



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to