nsivabalan edited a comment on issue #2178: URL: https://github.com/apache/hudi/issues/2178#issuecomment-711168471
I guess the small record size of 35 bytes throws it off. so, lets see what we can do. Before I go further, let me recap the SIMPLE bloom. Bloom filter will statically allocate the size based on the numEntries and fpp. So, irrespective of whether you add 50k entries or 500k entries or 1.5M entries (as per your config), bloom size is going to be 10MB or so. But it will guarantee the 1 * 10-9 false positive probability. Thats how a typical bloom works. If you initialize with 1.5M with 1 * 10-9 as fpp, it is going to allocate so many buckets. Once you exceed the number of entries, then the fpp may not be guaranteed. In other words, when looked up, more entries will result in false positive compared to 1 * 10-9. Coming back to the problem. So, with SIMPLE bloom, guess we can't do much given the small size of record size. And yes, all the configs you mentioned will help in reducing the time during index look up. On a high level, these are the steps done during index lookup. - Do range look up and filter out those data files whose range does not match input records(for each input record). - From the filtered ones, do bloom look up to further trim down the data files to be looked up. - After all these filtering, go ahead and look up in data files for each record key and return the matched location if found. So, few things to note here. - The range filtering will work only if your data is layed out such that each data files has a subset of entire dataset. If every data file's range is going to be more or less with similar min and max values, then this filtering may not help much. - Bloom filter look up: again, depending on fpp initialized, it will trim down most data files if not found. Having said all this, here is a rough idea of the bloom filter size based on diff values for numEntires and fpp. numEntries / FFP | 1 * 10-6 | 1 * 10-7 | 1 * 10-8 | 1 * 10-9 | -------|--------|---------|---------|----------- 100k | 400k| 560k| 640k| 710k| 250k | 1.2Mb| 1.4Mb | 1.6Mb | 1.8Mb| 500k | 2.3Mb | 2.8Mb| 3.1Mb| 3.6Mb| 750k | 3.6Mb| 4.1Mb| 4.8Mb| 5.4Mb| 1M | 4.8Mb| 5.6Mb| 6.4Mb| 71.2Mb| 1.25M| 6Mb| 7Mb| 8Mb| 9Mb| 1.5M | 7.2Mb| 8.4Mb| 9.6Mb| 10Mb| So, may be you can try 250k/500k with 1 * 10-6 or 1 * 10-7. Or better option is to run some workloads and determine which ones best fits your case. May I know whats the perf impact you are seeing btw. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
