vinothchandar commented on issue #2323:
URL: https://github.com/apache/hudi/issues/2323#issuecomment-744867989


   @kirkuz did you try out simply `BLOOM` instead of `GLOBAL_BLOOM`. If we 
could compare either GLOBAL_SIMPLE vs GLOBAL_BLOOM (or their non global 
counterparts), that would help understand. You can see that the amount of data 
shuffled is very different for both cases,  the expensive 
[stage](https://github.com/apache/hudi/blob/release-0.6.0/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java#L115)
 you see ends up triggering the actual check with bloom filters/records.
   
   From the screen shots above, I suspect the bloom filter is not as effective 
i.e lots of false positives leading to reading the entire set of data back for 
comparison (which is what the Simple index ends up doing, but without any 
overhead of checking filters, ranges etc). 
   
   To improve the bloom filter fpp, you could use the dynamic bloom filter 
which will adjust size automatically based on number of entries. 
   
   ```
   hoodie.bloom.index.filter.type=DYNAMIC_V0
   ```
   
   I initially suspected that reading all the file ranges etc are taking time, 
but that's not it. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to