prashantwason commented on PR #9106: URL: https://github.com/apache/hudi/pull/9106#issuecomment-1645952466
@danny0405 The max init values for other indexes are too low (See HUID 6553). Indexes are really useful for large datasets which have large number of partitions and files. Assume a large dataset with 100K+ files. The default parallelism of the index initialization in code is like 200 which would take HOURS for the indexes to be built. With a large parallelism: 1. The actual used parallelism is min(number_of_operations, 100,000) 2. So for small datasets, the lower value is used' 3. For larger datasets 100K is used. We routinely have datasets with over 1M files in them (as large as 6M files). I have tested with various parallelism values and its not an exact science but somewhere around 100,000 was where I got the fastest bootstrap of the indexes. Very large parallelism causes OOM and memory issues on Spark. If you leave the defaults to 200 -> many people would report timeouts building indexes on larger tables. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
