n3nash commented on issue #2620: URL: https://github.com/apache/hudi/issues/2620#issuecomment-827321400
@kimberlyamandalu @njalan @codejoyan There are a few problems when using BLOOM_INDEX 1. Depending on the number of entries in the parquet file, if the BLOOM_INDEX num_entries is not configured correctly, it will lead to lots of false positives that results in bloom index spending more time looking up data. You can check the default bloom index entries here -> https://github.com/apache/hudi/blob/5be3997f70415e1752a0b5214f9398880fc8fd1f/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java#L47. You can either increase this or use dynamic bloom filter. We are working on adding metrics to emit how many such false positives happened. 2. The BLOOM_INDEX step needs to perform a "listing" of the partitions to find the candidate files. On S3 without `hoodie.metadata.table` being enabled, this listing can take time. Enable the config to eliminate these file listings. 3. Depending on your workload, BLOOM_INDEX could, in some cases not be the ideal choice. For example, if you have updates across all your partitions, then using SIMPLE_INDEX is better since bloom will just do some extra work and then do the work that SIMPLE_INDEX would have done anyways. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org