Claudenw commented on issue #83: WIP: Initial bloom filter code contribution URL: https://github.com/apache/commons-collections/pull/83#issuecomment-544849307 Your statement is true. However you don't have 1 Bloom filter, you have at least 2. The one you describe (lets call it the target) is as you describe, The other (call it the candidate) is a Bloom filter comprised of the merge of several thousand (lets say 2,000) filters. You did not specify a probability but lets assume 1/1000. The candidate filter will have approx 4 * 2000 =-8000 bits enabled of which 4 may be collisions so 1996 at a minimum. So now you have a bit vector with 1996 integers to hold the merge (1996 * 32 = 63872 bits) . Your comparison now has to iterate through the list of integers in the target and see if they are in the candidate. The number of machine instructions instructions to perform the comparison assume 2 instructions on each filter (increment counter, retrieve) for each comparison will be approx (1996+4) * 2 = 4000 instructions to retrieve the info + up to 4 compare functions. If you were to encode the 1000 bits as longs you would have ceil(1000/64) = 16 longs. (16 * 64 = 1024 bits) To perform the comparison you have to perform 64 operations. Each filter requires 16 * 2 instructions to fetch the longs + 16 AND instructions + 16 compare instructions = 16*4 = 64 instructions to perform the entire calculation. So by using a bit vector you waste some space in the target filter but save significant space in the candidate and the speed of the comparison is much faster using the bit vectors than it is with the integers. Infact, it is the absolute speed of the bit vector comparison that makes indexing Bloom filters so hard. The overhead of the index often eats up the advantage of performing fewer checks. Claude On Tue, Oct 22, 2019 at 2:12 AM SilverNarcissus <[email protected]> wrote: > For checking whether an object is in the bloom filter. Assuming we hash > the object by 4 hash functions and there result is 1, 2000, 8888, 10000,. > We only need to check whether 1, 2000, 8888, 10000 bit location in the > array is 1. Creating a new bloom filter we need 10000 bits extra space and > we have to compare all bits of the two array. If we do the checking > operation 10000 times per second, may be there is an huge overhead. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/commons-collections/pull/83?email_source=notifications&email_token=AASTVHSVBJOSZZXAQJYQBQTQPZHRDA5CNFSM4IWGTGA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEB4ITVQ#issuecomment-544770518>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AASTVHSR4J5IT3BD65BNRNLQPZHRDANCNFSM4IWGTGAQ> > . > -- I like: Like Like - The likeliest place on the web <http://like-like.xenei.com> LinkedIn: http://www.linkedin.com/in/claudewarren
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
