vinothchandar commented on issue #2323: URL: https://github.com/apache/hudi/issues/2323#issuecomment-745748741
@kirkuz 1. GLOBAL indexes with the config set to update partition path will solve the problem for you. Either GLOBAL_BLOOM/GLOBAL_SIMPLE. Indexing is not that well explained tbh, but there is a [draft](https://github.com/apache/hudi/pull/2245) of an upcoming blog, which can help you with some context. High level, you use SIMPLE (or GLOBAL_SIMPLE) if you believe the update pattern spreads uniformly across files such that bloom filters/range pruning all will not help much in cutting down the number of files inspected. 2. I think the issue is more to do with false positives and as a result a lot of data being shuffled. We can tune the cores/memory etc down the line. Your comparison against SIMPLE was bit apples-oranges coz it only reads 200 odd partitions that are actually affected by the write, as opposed to GLOBAL_BLOOM which inspects all 4000 odd partitions. 3. GLOBAL_BLOOM works when used with small dimension tables, for large tables, people either use HBase or tune bloom index. We also plan to add record level indexes RFC-08 in the next release, that gives HBase like performance without the external system overhead. but its not there atm. 4. No I don't think so. Index look up actually likes it when batched, so it can amortize lot of these costs. As concrete next steps, I suggest turning on dynamic bloom filters and get two runs : 1 for BLOOM, 1 for GLOBAL_BLOOM and we go from there. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
