[
https://issues.apache.org/jira/browse/HUDI-9665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HUDI-9665:
---------------------------------
Labels: pull-request-available (was: )
> Repartition the write status RDD for MDT DAG to avoid long processing
> durations
> -------------------------------------------------------------------------------
>
> Key: HUDI-9665
> URL: https://issues.apache.org/jira/browse/HUDI-9665
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: Rajesh Mahindra
> Assignee: sivabalan narayanan
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 1.1.0
>
>
> After data table write status is collected, and prepared to be written to the
> MDT table, if the parallelism of write status RDD is too high (for instance,
> if 100,000's of files were touched), then the entire workload profile stages
> for MDT DAG could take 10's of mins. Put up a PR with a small fix that
> repartitions the write status RDD to a configurable max partitions to reduce
> latencies.
> For instance, this is what we did in a POC. Added following code to execute()
> in
> hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java
> if (table.isMetadataTable() &&
> config.getProps().getInteger("hoodie.metadata.temp.repartition.parallelism",
> 0) > 0) {
> inputRecords =
> inputRecords.repartition(config.getProps().getInteger("hoodie.metadata.temp.repartition.parallelism",
> 16801));
> }
--
This message was sent by Atlassian Jira
(v8.20.10#820010)