vinothchandar commented on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-711466156


   >However what's the recommended approach in terms of indexing here ? I see 
various features are available out of the box.
   
   Been meaning to write a blog that walks through the options here. Interested 
in being a reviewer? That will also help explain this better for yourself as 
well :) 
   
   In short, you can pick options based on your workload (we intend to make 
dynamic_bloom default in 0.7.0 going forward, which should help with this 
issue?) 
   
   - If you have records where there are ordered keys (e.g timestamp prefix), 
then Bloom index with range pruning will do an excellent job. It will be able 
to quickly prune out large number of files to compare against and just use 
bloom filters for the rest.
   - if you have records with no ordering in them (e.g uuid), but the pattern 
is such that mostly the recent partitions are updated with a long tail of 
updates/deletes to the older partitions, then still bloom index will be faster. 
but better to turn off range pruning, since it does not help, just incurs the 
cost of checking.  
   
   - If your update patterns are totally random i.e each commit affects almost 
every file, you can use SIMPLE_INDEX (which will join against the entire table) 
or if you have HBase - HBASE index (you can write your own index as well, its 
pluggable). We are working on adding a record level index natively within Hudi 
in the next major release hopefully 
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to