[GitHub] [commons-collections] Claudenw commented on issue #83: WIP: Initial bloom filter code contribution

GitBox Tue, 22 Oct 2019 01:04:26 -0700

Claudenw commented on issue #83: WIP: Initial bloom filter code contribution
URL: 
https://github.com/apache/commons-collections/pull/83#issuecomment-544849307

   Your statement is true. However you don't have 1 Bloom filter, you have at
   least 2.  The one you describe (lets call it the target) is as you
   describe, The other (call it the candidate) is a Bloom filter comprised of
   the merge of several thousand (lets say 2,000) filters.  You did not
   specify a probability but lets assume 1/1000.  The candidate filter will
   have approx 4 * 2000 =-8000 bits enabled of which 4 may be collisions so
   1996 at a minimum.

   So now you have a bit vector with 1996 integers to hold the merge (1996 *
   32 = 63872 bits) .  Your comparison now has to iterate through the list of
   integers in the target and see if they are in the candidate.  The number of
   machine instructions instructions to perform the comparison assume 2
   instructions on each filter (increment counter, retrieve) for each
   comparison will be approx (1996+4) * 2 = 4000 instructions to retrieve the
   info +  up to 4 compare functions.

   If you were to encode the 1000 bits as longs you would have ceil(1000/64) =
   16 longs. (16 * 64 = 1024 bits)
   To perform the comparison you have to perform 64 operations. Each filter
   requires 16 * 2 instructions to fetch the longs + 16 AND instructions + 16
   compare instructions = 16*4 = 64 instructions to perform the entire
   calculation.

   So by using a bit vector you waste some space in the target filter but save
   significant space in the candidate and the speed of the comparison is much
   faster using the bit vectors than it is with the integers.  Infact, it is
   the absolute speed of the bit vector comparison that makes indexing Bloom
   filters so hard.  The overhead of the index often eats up the advantage of
   performing fewer checks.

   Claude

   On Tue, Oct 22, 2019 at 2:12 AM SilverNarcissus <[email protected]>
   wrote:

   > For checking whether an object is in the bloom filter. Assuming we hash
   > the object by 4 hash functions and there result is 1, 2000, 8888, 10000,.
   > We only need to check whether 1, 2000, 8888, 10000 bit location in the
   > array is 1. Creating a new bloom filter we need 10000 bits extra space and
   > we have to compare all bits of the two array. If we do the checking
   > operation 10000 times per second, may be there is an huge overhead.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > 
<https://github.com/apache/commons-collections/pull/83?email_source=notifications&email_token=AASTVHSVBJOSZZXAQJYQBQTQPZHRDA5CNFSM4IWGTGA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEB4ITVQ#issuecomment-544770518>,
   > or unsubscribe
   > 
<https://github.com/notifications/unsubscribe-auth/AASTVHSR4J5IT3BD65BNRNLQPZHRDANCNFSM4IWGTGAQ>
   > .
   >

   -- 
   I like: Like Like - The likeliest place on the web
   <http://like-like.xenei.com>
   LinkedIn: http://www.linkedin.com/in/claudewarren


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [commons-collections] Claudenw commented on issue #83: WIP: Initial bloom filter code contribution

Reply via email to