vinothchandar commented on issue #2323:
URL: https://github.com/apache/hudi/issues/2323#issuecomment-745748741


   @kirkuz 
   1. GLOBAL indexes with the config set to update partition path will solve 
the problem for you. Either GLOBAL_BLOOM/GLOBAL_SIMPLE. Indexing is not that 
well explained tbh, but there is a 
[draft](https://github.com/apache/hudi/pull/2245) of an upcoming blog, which 
can help you with some context. High level, you use SIMPLE (or GLOBAL_SIMPLE) 
if you believe the update pattern spreads uniformly across files such that 
bloom filters/range pruning all will not help much in cutting down the number 
of files inspected. 
   
   2. I think the issue is more to do with false positives and as a result a 
lot of data being shuffled.  We can tune the cores/memory etc down the line. 
Your comparison against SIMPLE was bit apples-oranges coz it only reads 200 odd 
partitions that are actually affected by the write, as opposed to GLOBAL_BLOOM 
which inspects all 4000 odd partitions. 
   
   3. GLOBAL_BLOOM works when used with small dimension tables, for large 
tables, people either use HBase or tune bloom index. We also plan to add record 
level indexes RFC-08  in the next release, that gives HBase like performance 
without the external system overhead. but its not there atm. 
   
   4. No I don't think so. Index look up actually likes it when batched, so it 
can amortize lot of these costs. 
   
   As concrete next steps, I suggest turning on dynamic bloom filters and get 
two runs : 1 for BLOOM, 1 for GLOBAL_BLOOM and we go from there. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to