vinothchandar commented on issue #2323: URL: https://github.com/apache/hudi/issues/2323#issuecomment-744867989
@kirkuz did you try out simply `BLOOM` instead of `GLOBAL_BLOOM`. If we could compare either GLOBAL_SIMPLE vs GLOBAL_BLOOM (or their non global counterparts), that would help understand. You can see that the amount of data shuffled is very different for both cases, the expensive [stage](https://github.com/apache/hudi/blob/release-0.6.0/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java#L115) you see ends up triggering the actual check with bloom filters/records. From the screen shots above, I suspect the bloom filter is not as effective i.e lots of false positives leading to reading the entire set of data back for comparison (which is what the Simple index ends up doing, but without any overhead of checking filters, ranges etc). To improve the bloom filter fpp, you could use the dynamic bloom filter which will adjust size automatically based on number of entries. ``` hoodie.bloom.index.filter.type=DYNAMIC_V0 ``` I initially suspected that reading all the file ranges etc are taking time, but that's not it. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
