nsivabalan edited a comment on issue #2178:
URL: https://github.com/apache/hudi/issues/2178#issuecomment-711168471


   I guess the small record size of 35 bytes throws it off. so, lets see what 
we can do. 
   Before I go further, let me recap the SIMPLE bloom. 
   Bloom filter will statically allocate the size based on the numEntries and 
fpp. So, irrespective of whether you add 50k entries or 500k entries or 1.5M 
entries (as per your config), bloom size is going to be 10MB or so. But it will 
guarantee the 1 * 10-9 false positive probability. Thats how a typical bloom 
works. If you initialize with 1.5M with 1 * 10-9 as fpp, it is going to 
allocate so many buckets. 
   Once you exceed the number of entries, then the fpp may not be guaranteed. 
In other words, when looked up, more entries will result in false positive 
compared to 1 * 10-9. 
   
   Coming back to the problem. So, with SIMPLE bloom, guess we can't do much 
given the small size of record size. And yes, all the configs you mentioned 
will help in reducing the time during index look up. 
   On a high level, these are the steps done during index lookup.
   - Do range look up and filter out those data files whose range does not 
match input records(for each input record). 
   - From the filtered ones, do bloom look up to further trim down the data 
files to be looked up. 
   - After all these filtering, go ahead and look up in data files for each 
record key and return the matched location if found. 
   
   So, few things to note here. 
   - The range filtering will work only if your data is layed out such that 
each data files has a subset of entire dataset. If every data file's range is 
going to be more or less with similar min and max values, then this filtering 
may not help much. 
   - Bloom filter look up: again, depending on fpp initialized, it will trim 
down most data files if not found. 
   
   Having said all this, here is a rough idea of the bloom filter size based on 
diff values for numEntires and fpp.
   
   numEntries / FFP | 1 * 10-6 | 1 * 10-7 | 1 * 10-8 | 1 * 10-9 | 
   -------|--------|---------|---------|-----------
   100k | 400k| 560k| 640k| 710k|
   250k | 1.2Mb| 1.4Mb | 1.6Mb | 1.8Mb|
   500k | 2.3Mb | 2.8Mb| 3.1Mb| 3.6Mb|
   750k | 3.6Mb| 4.1Mb| 4.8Mb| 5.4Mb|
   1M | 4.8Mb| 5.6Mb| 6.4Mb| 71.2Mb|
   1.25M| 6Mb| 7Mb| 8Mb| 9Mb|
   1.5M | 7.2Mb| 8.4Mb| 9.6Mb| 10Mb| 
   
   So, may be you can try 250k/500k with 1 * 10-6 or 1 * 10-7. 
   Or better option is to run some workloads and determine which ones best fits 
your case. 
   
   May I know whats the perf impact you are seeing btw. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to