Prashant Wason created HUDI-6553:
------------------------------------
Summary: Speedup column stats and bloom index creation on large
datasets
Key: HUDI-6553
URL: https://issues.apache.org/jira/browse/HUDI-6553
Project: Apache Hudi
Issue Type: Improvement
Reporter: Prashant Wason
Assignee: Prashant Wason
During initialization of column_stats and bloom_filter MDT partitions, the code
which creates the records for these partitions is written as such:
# Create a Map of partitionName -> List of files in partition
# Parallelize the above Map
# Each executor handles a single partition
For large datasets the above design cause the following limitations:
# Each executor handles a single partition. So we cannot speed up by throwing
more executors.
# If one partitions has much larger number of files than other partitions,
then a single executor would be the bottleneck for the initialization
completion and other executors would be idle.
In this enhancement I am changing the parallelism to be at a file-level.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)