[
https://issues.apache.org/jira/browse/HUDI-6553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Y Ethan Guo updated HUDI-6553:
------------------------------
Fix Version/s: 0.14.0
> Speedup column stats and bloom index creation on large datasets
> ---------------------------------------------------------------
>
> Key: HUDI-6553
> URL: https://issues.apache.org/jira/browse/HUDI-6553
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: Prashant Wason
> Assignee: Prashant Wason
> Priority: Major
> Labels: pull-request-available, release-0.14.0-blocker
> Fix For: 0.14.0
>
>
> During initialization of column_stats and bloom_filter MDT partitions, the
> code which creates the records for these partitions is written as such:
> # Create a Map of partitionName -> List of files in partition
> # Parallelize the above Map
> # Each executor handles a single partition
> For large datasets the above design cause the following limitations:
> # Each executor handles a single partition. So we cannot speed up by
> throwing more executors.
> # If one partitions has much larger number of files than other partitions,
> then a single executor would be the bottleneck for the initialization
> completion and other executors would be idle.
>
> In this enhancement I am changing the parallelism to be at a file-level.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)