[GitHub] [incubator-hudi] nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter

2019-12-15 Thread GitBox
nsivabalan commented on issue #976: [HUDI-106] Adding support for 
DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#issuecomment-565845894
 
 
   @vinothchandar : the diff is ready. you are good to review. Sorry, forgot 
you ping. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter

2019-12-13 Thread GitBox
nsivabalan commented on issue #976: [HUDI-106] Adding support for 
DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#issuecomment-565650865
 
 
   Fixed the check style issue. 
   Pending items:
   - One nit in CleanerUtils
   - Have we add test classes for InternalDynamicBloomFilter and bounded filter 
classes.. We need some tests that checks the bounding & dynamic aspects?
   
   Will update you once done. 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter

2019-12-08 Thread GitBox
nsivabalan commented on issue #976: [HUDI-106] Adding support for 
DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#issuecomment-563003172
 
 
   > @nsivabalan whats pending here still for final check and merge?
   
   Nothing much. just your final set of review and I need to fix the build 
failure. for some reason, my local build is succeeding but travis ci build 
fails with checkstyle error specifically with import ordering. Will work on it, 
but don't need to hold your review until I fix the build issue. Once I get your 
final set of comments, will fix all together. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter

2019-11-26 Thread GitBox
nsivabalan commented on issue #976: [HUDI-106] Adding support for 
DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#issuecomment-558927506
 
 
   Here are the results from FP ratio testing. 
   Exp params:  Error ratio 1.0E-6. For HoodieDynamic Capped bloom filter, max 
number of entries = 5 * init numEntries
   
   InitNumEntries/EntriesAdded |Hadoop Dynamic(non capped) | 
HoodieDynamic(Capped)
   -- | -- | --
   5k/10k | 1.9E-6 | 1.8E-6
   5k/50k | 9.1E-6| 0.0162626
   10k/10k | 1.4E-6 | 1.1E-6
   10k/50k | 3.7E-6 | 5.6E-6
   10k/60k | 7.6E-6 | 5.7E-5
   
   Serialized size in bytes
   
   InitNumEntries/EntriesAdded |Simple | Hadoop Dynamic(non capped) | 
HoodieDynamic(Capped)
   -- | -- | -- | --
   5k/10k |  23980 | 47996 | 47996
   5k/50k | 23980 | 287796 | 119936
   10k/10k | 47944 |47976 |  47976
   10k/50k | 47944 | 287692 | 239748
   10k/60k | 47944 | 575348 | 239748
   
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter

2019-11-26 Thread GitBox
nsivabalan commented on issue #976: [HUDI-106] Adding support for 
DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#issuecomment-558925795
 
 
   I have added our own impl of DynamicBloom where in we cap the max number of 
entries. In other words, up until maxNumEntries, the FP ratio will be honored, 
after which there are no gaurantees. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter

2019-11-03 Thread GitBox
nsivabalan commented on issue #976: [HUDI-106] Adding support for 
DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#issuecomment-549160133
 
 
   > @nsivabalan Here is the problem as I see it, w.r.t bounding size.. 
Currently we have a low default 60K, which comes out to reading ~400kb from the 
parquet footer. not too shabby a overhead.. My understanding is that the 
parquet footers are all read at once and even query engines would read the 
footer.. So if we don't bound the size of the dynamic bloom fitler to say 1MB 
or so, queries can pay a penalty? (I dont know how big this would be or if its 
okay) But we won't offer the user the choice to make tradeoffs.. IIUC we need 
our own impl of dynamic bloom if we were to limit the size.. correct? how 
doable is that?
   > 
   > > I am trying to find ways to test the FP ratio. Not sure how would you 
test that.
   > > The way I have done it in the past, is to generate a lot of key and hold 
it in two lists : added, notAdded.. I add the ones from `added` to bloom filter 
and then check for false positives using notAdded list.. how much % of not 
added had a hit is your fp.. For this impl, we need ensure that the fp ratio 
remains the same, even as you increase the size of added/notAdded lists..
   > 
   > Other small points:
   > 
   > * if we don't have a way to configure the bloom type to use, we should add 
one
   > * We should consider if the default here should be `the error rate 10^-9`. 
This will also help reduce the size.. we already have techniques like range 
pruning  to reduce the amount of comparisons.. Assuming even a large 100M 
entries inserted into a partition, if the bloom filter had `10^-8`, it might be 
enough prevent  false positives right.. I guess this will drop the storage 
needed considerably?
   
   Thanks Vinoth for the detailed response. 
   - To bound the dynamic bloom, we have two options. a. I can google to find 
if we have some ready to use solution. b. If nothing exists, we can come up 
with our own DynamicBloom with max bounds on the no of entries. Just that until 
we reach the max no of entries, the FP ratio will be honored, after which the 
FP ratio may start to increase. I can get this done. 
   - Wrt testing FP ratio, thanks for the idea. I had similar idea. 
   - Here are the size differences between FP ratio 10^-8 and 10^-9. Key size 
considered is 50 bytes. 
   
   Size | FP 10^-8 | FP 10^-9
   -- | -- | --
   10k | 380k | 430k
   100k | 760k | 860k
   1M | 6.5MB | 7.3MB
   10M | 64MB | 72MB
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter

2019-11-01 Thread GitBox
nsivabalan commented on issue #976: [HUDI-106] Adding support for 
DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#issuecomment-548883578
 
 
   > Left some comments... can we also add a test to test the "dynamic" nature 
of the filter. e,g having more entries should result in larger filter with same 
fp ratio.. And also how are you enforcing a maximum dynamic bloom filter size. 
Can you share data on how big the bloom filter would be, if you say wrote 1M 
keys at fpp ratio 10^-9
   
   Few questions/clarifications:
   - I guess you can't bound the size in dynamic bloom filter. Size will grow 
according to the number to entries added. Initialize number of entries passed 
will be used to set the min size. 
   - I am trying to find ways to test the FP ratio. Not sure how would you test 
that. 
   - I was able to verify that adding more entries to the filter than the 
initial size, increases the size of the bloom. 
   - Here are the sizes of dynamic bloom filter with error rate 10^-9 and 
initial number of entries as 10k
Size of bloom with 100 entries = 71940 bytes ~= 71kb
Size of bloom with 1000 entries = 71940 bytes ~= 71kb
Size of bloom with 1 entries = 71940 bytes ~= 71kb
Size of bloom with 10 entries = 719088 bytes ~= 720kb
Size of bloom with 100 entries  = 7190568 bytes ~= 7.1 MB


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter

2019-10-29 Thread GitBox
nsivabalan commented on issue #976: [HUDI-106] Adding support for 
DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#issuecomment-547716257
 
 
   Should I create a package for bloom.filter in org.apache.hudi.common? and 
move all BloomFilter related classes to it. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on issue #976: [HUDI-106] Adding support for DynamicBloomFilter

2019-10-29 Thread GitBox
nsivabalan commented on issue #976: [HUDI-106] Adding support for 
DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#issuecomment-547458994
 
 
   > @nsivabalan can you rebase against master and see if that helps the test 
pass
   
   Nope. I didn't import the checkstyle. will fix it. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services