n3nash commented on issue #2620:
URL: https://github.com/apache/hudi/issues/2620#issuecomment-827321400


   @kimberlyamandalu @njalan @codejoyan There are a few problems when using 
BLOOM_INDEX
   
   1. Depending on the number of entries in the parquet file, if the 
BLOOM_INDEX num_entries is not configured correctly, it will lead to lots of 
false positives that results in bloom index spending more time looking up data. 
You can check the default bloom index entries here -> 
https://github.com/apache/hudi/blob/5be3997f70415e1752a0b5214f9398880fc8fd1f/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java#L47.
 You can either increase this or use dynamic bloom filter. We are working on 
adding metrics to emit how many such false positives happened. 
   2. The BLOOM_INDEX step needs to perform a "listing" of the partitions to 
find the candidate files. On S3 without `hoodie.metadata.table` being enabled, 
this listing can take time. Enable the config to eliminate these file listings.
   3. Depending on your workload, BLOOM_INDEX could, in some cases not be the 
ideal choice. For example, if you have updates across all your partitions, then 
using SIMPLE_INDEX is better since bloom will just do some extra work and then 
do the work that SIMPLE_INDEX would have done anyways. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to