yihua opened a new pull request, #13284: URL: https://github.com/apache/hudi/pull/13284
### Change Logs This PR improves bloom filter bucketizing in Spark. Currently the parallelism of the bloom filter bucketizing is limited by the input parallelism if the bloom index parallelism is not configured, which becomes the bottleneck if the number of file groups and buckets (based on the number of key per bucket) is much larger. To improve the performance of bucketizing bloom filter checking, the new feature of dynamic parallelism and config `hoodie.bloom.index.bucketized.checking.with.dynamic.parallelism` is added. When enabled and bloom index parallelism is not configured, the index parallelism is dynamically determined by the number of file groups to look up and the number of keys per bucket to split comparisons within a file group. In this case, the duration of each task is bounded by the latency of reading bloom filter and keys per bucket to check on a single base file, so the skew is much more fine-grained and controllable. ### Impact Improves bloom filter bucketizing to avoid skews in the task execution time compared to the global sorting based on the fileId and key is enabled during key lookup. ### Risk level none. The improvement is guarded by a flag and turned off by default. ### Documentation Update The new config docs is added. ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
