Rajesh Mahindra created HUDI-9665:
-------------------------------------
Summary: Repartition the write status RDD for MDT DAG to avoid
long processing durations
Key: HUDI-9665
URL: https://issues.apache.org/jira/browse/HUDI-9665
Project: Apache Hudi
Issue Type: Improvement
Reporter: Rajesh Mahindra
After data table write status is collected, and prepared to be written to the
MDT table, if the parallelism of write status RDD is too high (for instance, if
100,000's of files were touched), then the entire workload profile stages for
MDT DAG could take 10's of mins. Put up a PR with a small fix that repartitions
the write status RDD to a configurable max partitions to reduce latencies.
For instance, this is what we did in a POC. Added following code to execute()
in
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java
if (table.isMetadataTable() &&
config.getProps().getInteger("hoodie.metadata.temp.repartition.parallelism", 0)
> 0) {
inputRecords =
inputRecords.repartition(config.getProps().getInteger("hoodie.metadata.temp.repartition.parallelism",
16801));
}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)