[GitHub] [hudi] prashantwason commented on pull request #9106: [HUDI-6118] Some fixes to improve the MDT and record index code base.

via GitHub Fri, 21 Jul 2023 09:30:52 -0700


prashantwason commented on PR #9106:
URL: https://github.com/apache/hudi/pull/9106#issuecomment-1645952466


   @danny0405 The max init values for other indexes are too low (See HUID 
6553). Indexes are really useful for large datasets which have large number of 
partitions and files. Assume a large dataset with 100K+ files. The default 
parallelism of the index initialization in code is like 200 which would take 
HOURS for the indexes to be built. With a large parallelism:
   1. The actual used parallelism is min(number_of_operations, 100,000)
   2. So for small datasets, the lower value is used'
   3. For larger datasets 100K is used.
   
   We routinely have datasets with over 1M files in them (as large as 6M 
files). I have tested with various parallelism values and its not an exact 
science but somewhere around 100,000 was where I got the fastest bootstrap of 
the indexes. Very large parallelism causes OOM and memory issues on Spark.
   
   If you leave the defaults to 200 -> many people would report timeouts 
building indexes on larger tables.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] prashantwason commented on pull request #9106: [HUDI-6118] Some fixes to improve the MDT and record index code base.

Reply via email to