[ 
https://issues.apache.org/jira/browse/HUDI-6553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Y Ethan Guo updated HUDI-6553:
------------------------------
    Fix Version/s: 0.14.0

> Speedup column stats and bloom index creation on large datasets
> ---------------------------------------------------------------
>
>                 Key: HUDI-6553
>                 URL: https://issues.apache.org/jira/browse/HUDI-6553
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: Prashant Wason
>            Assignee: Prashant Wason
>            Priority: Major
>              Labels: pull-request-available, release-0.14.0-blocker
>             Fix For: 0.14.0
>
>
> During initialization of column_stats and bloom_filter MDT partitions, the 
> code which creates the records for these partitions is written as such:
>  # Create a Map of partitionName -> List of files in partition
>  # Parallelize the above Map 
>  # Each executor handles a single partition
> For large datasets the above design cause the following limitations:
>  # Each executor handles a single partition. So we cannot speed up by 
> throwing more executors.
>  # If one partitions has much larger number of files than other partitions, 
> then a single executor would be the bottleneck for the initialization 
> completion and other executors would be idle.
>  
> In this enhancement I am changing the parallelism to be at a file-level.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to