Prashant Wason created HUDI-6553:
------------------------------------

             Summary: Speedup column stats and bloom index creation on large 
datasets
                 Key: HUDI-6553
                 URL: https://issues.apache.org/jira/browse/HUDI-6553
             Project: Apache Hudi
          Issue Type: Improvement
            Reporter: Prashant Wason
            Assignee: Prashant Wason


During initialization of column_stats and bloom_filter MDT partitions, the code 
which creates the records for these partitions is written as such:
 # Create a Map of partitionName -> List of files in partition
 # Parallelize the above Map 
 # Each executor handles a single partition

For large datasets the above design cause the following limitations:
 # Each executor handles a single partition. So we cannot speed up by throwing 
more executors.
 # If one partitions has much larger number of files than other partitions, 
then a single executor would be the bottleneck for the initialization 
completion and other executors would be idle.

 

In this enhancement I am changing the parallelism to be at a file-level.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to