[GitHub] [incubator-hudi] nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter
nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter URL: https://github.com/apache/incubator-hudi/pull/976#issuecomment-565845894 @vinothchandar : the diff is ready. you are good to review. Sorry, forgot you ping. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter
nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter URL: https://github.com/apache/incubator-hudi/pull/976#issuecomment-565650865 Fixed the check style issue. Pending items: - One nit in CleanerUtils - Have we add test classes for InternalDynamicBloomFilter and bounded filter classes.. We need some tests that checks the bounding & dynamic aspects? Will update you once done. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter
nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter URL: https://github.com/apache/incubator-hudi/pull/976#issuecomment-563003172 > @nsivabalan whats pending here still for final check and merge? Nothing much. just your final set of review and I need to fix the build failure. for some reason, my local build is succeeding but travis ci build fails with checkstyle error specifically with import ordering. Will work on it, but don't need to hold your review until I fix the build issue. Once I get your final set of comments, will fix all together. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter
nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter URL: https://github.com/apache/incubator-hudi/pull/976#issuecomment-558927506 Here are the results from FP ratio testing. Exp params: Error ratio 1.0E-6. For HoodieDynamic Capped bloom filter, max number of entries = 5 * init numEntries InitNumEntries/EntriesAdded |Hadoop Dynamic(non capped) | HoodieDynamic(Capped) -- | -- | -- 5k/10k | 1.9E-6 | 1.8E-6 5k/50k | 9.1E-6| 0.0162626 10k/10k | 1.4E-6 | 1.1E-6 10k/50k | 3.7E-6 | 5.6E-6 10k/60k | 7.6E-6 | 5.7E-5 Serialized size in bytes InitNumEntries/EntriesAdded |Simple | Hadoop Dynamic(non capped) | HoodieDynamic(Capped) -- | -- | -- | -- 5k/10k | 23980 | 47996 | 47996 5k/50k | 23980 | 287796 | 119936 10k/10k | 47944 |47976 | 47976 10k/50k | 47944 | 287692 | 239748 10k/60k | 47944 | 575348 | 239748 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter
nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter URL: https://github.com/apache/incubator-hudi/pull/976#issuecomment-558925795 I have added our own impl of DynamicBloom where in we cap the max number of entries. In other words, up until maxNumEntries, the FP ratio will be honored, after which there are no gaurantees. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter
nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter URL: https://github.com/apache/incubator-hudi/pull/976#issuecomment-549160133 > @nsivabalan Here is the problem as I see it, w.r.t bounding size.. Currently we have a low default 60K, which comes out to reading ~400kb from the parquet footer. not too shabby a overhead.. My understanding is that the parquet footers are all read at once and even query engines would read the footer.. So if we don't bound the size of the dynamic bloom fitler to say 1MB or so, queries can pay a penalty? (I dont know how big this would be or if its okay) But we won't offer the user the choice to make tradeoffs.. IIUC we need our own impl of dynamic bloom if we were to limit the size.. correct? how doable is that? > > > I am trying to find ways to test the FP ratio. Not sure how would you test that. > > The way I have done it in the past, is to generate a lot of key and hold it in two lists : added, notAdded.. I add the ones from `added` to bloom filter and then check for false positives using notAdded list.. how much % of not added had a hit is your fp.. For this impl, we need ensure that the fp ratio remains the same, even as you increase the size of added/notAdded lists.. > > Other small points: > > * if we don't have a way to configure the bloom type to use, we should add one > * We should consider if the default here should be `the error rate 10^-9`. This will also help reduce the size.. we already have techniques like range pruning to reduce the amount of comparisons.. Assuming even a large 100M entries inserted into a partition, if the bloom filter had `10^-8`, it might be enough prevent false positives right.. I guess this will drop the storage needed considerably? Thanks Vinoth for the detailed response. - To bound the dynamic bloom, we have two options. a. I can google to find if we have some ready to use solution. b. If nothing exists, we can come up with our own DynamicBloom with max bounds on the no of entries. Just that until we reach the max no of entries, the FP ratio will be honored, after which the FP ratio may start to increase. I can get this done. - Wrt testing FP ratio, thanks for the idea. I had similar idea. - Here are the size differences between FP ratio 10^-8 and 10^-9. Key size considered is 50 bytes. Size | FP 10^-8 | FP 10^-9 -- | -- | -- 10k | 380k | 430k 100k | 760k | 860k 1M | 6.5MB | 7.3MB 10M | 64MB | 72MB This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter
nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter URL: https://github.com/apache/incubator-hudi/pull/976#issuecomment-548883578 > Left some comments... can we also add a test to test the "dynamic" nature of the filter. e,g having more entries should result in larger filter with same fp ratio.. And also how are you enforcing a maximum dynamic bloom filter size. Can you share data on how big the bloom filter would be, if you say wrote 1M keys at fpp ratio 10^-9 Few questions/clarifications: - I guess you can't bound the size in dynamic bloom filter. Size will grow according to the number to entries added. Initialize number of entries passed will be used to set the min size. - I am trying to find ways to test the FP ratio. Not sure how would you test that. - I was able to verify that adding more entries to the filter than the initial size, increases the size of the bloom. - Here are the sizes of dynamic bloom filter with error rate 10^-9 and initial number of entries as 10k Size of bloom with 100 entries = 71940 bytes ~= 71kb Size of bloom with 1000 entries = 71940 bytes ~= 71kb Size of bloom with 1 entries = 71940 bytes ~= 71kb Size of bloom with 10 entries = 719088 bytes ~= 720kb Size of bloom with 100 entries = 7190568 bytes ~= 7.1 MB This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter
nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter URL: https://github.com/apache/incubator-hudi/pull/976#issuecomment-547716257 Should I create a package for bloom.filter in org.apache.hudi.common? and move all BloomFilter related classes to it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter
nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter URL: https://github.com/apache/incubator-hudi/pull/976#issuecomment-547458994 > @nsivabalan can you rebase against master and see if that helps the test pass Nope. I didn't import the checkstyle. will fix it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services