yihua opened a new pull request, #13284:
URL: https://github.com/apache/hudi/pull/13284

   ### Change Logs
   
   This PR improves bloom filter bucketizing in Spark.  Currently the 
parallelism of the bloom filter bucketizing is limited by the input parallelism 
if the bloom index parallelism is not configured, which becomes the bottleneck 
if the number of file groups and buckets (based on the number of key per 
bucket) is much larger.
   
   To improve the performance of bucketizing bloom filter checking, the new 
feature of dynamic parallelism and config 
`hoodie.bloom.index.bucketized.checking.with.dynamic.parallelism` is added.  
When enabled and bloom index parallelism is not configured, the index 
parallelism is dynamically determined by the number of file groups to look up 
and the number of keys per bucket to split comparisons within a file group.  In 
this case, the duration of each task is bounded by the latency of reading bloom 
filter and keys per bucket to check on a single base file, so the skew is much 
more fine-grained and controllable.
   
   ### Impact
   
   Improves bloom filter bucketizing to avoid skews in the task execution time 
compared to the global sorting based on the fileId and key is enabled during 
key lookup.
   
   ### Risk level
   
   none.  The improvement is guarded by a flag and turned off by default.
   
   ### Documentation Update
   
   The new config docs is added.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to