aherbert commented on PR #695:
URL: 
https://github.com/apache/commons-collections/pull/695#issuecomment-4804426754

   Short answer: Bloom filter operations are not atomic. Making them so is too 
expensive. Whatever option we choose here will be a partial fix and the filter 
is likely compromised in any case. The library should be used when you have 
control over the entire stack of code that generates indices. Then you will not 
have a problem. This is a bug for an atypical use case.
   
   ---
   
   I think that the library is mainly optimised for correct usage. If you 
deliberately use bad content then this lazy checking will result in the filter 
notifying the caller via an exception that it failed after it has modified its 
internal state. The filter should then be considered broken.
   
   Note that if you do early checking on the indices and raise an exception on 
the first bad index, you may have already merged some other indices and so have 
a broken filter anyway. Unfortunately making the filter merge operations atomic 
is far too expensive for practical use. So any merge will likely break the 
filter if an index is invalid, no matter what code changes we make here.
   
   The OP notes that index checking is performed in SimpleBloomFilter so that 
cannot have a corrupted state. But it would be corrupted if indices have been 
already merged. Note also in the same class the 
``SimpleBloomFilter.merge(BitMapExtractor)`` can merge bits above the size of 
the filter defined by the shape. It will raise an exception but not fix its own 
state (which it could do with masking out the bad bits before raising the 
exception). So that also has an issue with broken cardinality after a bad merge.
   
   We could modify the code to remove invalid indices before raising. Here the 
indices are in a TreeSet. So the indices could be removed with:
   ```java
   indices.tailSet(shape.getNumberOfBits()).clear();
   ```
   
   But this only partially corrects something that is broken. It is still 
broken, just less so. I think there are several solutions which require some 
discussion:
   
   1. Clean up bad indices before throwing. The filter could have partially 
merged indices that do not correspond to any real hashed object.
   2. Leave it as is and document that you should never ignore exceptions and 
then forever consider the filter compromised.
   3. Set an internal flag before throwing the exception. This internal state 
flag would be similar to the one used in the ``ArrayCountingBloomFilter``. It 
knows when it has been broken based on this state flag. You can check this 
using ``isValid()`` at any point.
   4. Make them atomic. You could go through all the indices and check them; 
them go through them all again and add them. This is double the computation 
cost for all correct usage of the classes to stop malicious usage which really 
should not be a problem anyway.
   
   Unfortunately the ``isValid`` flag is part of the ``CountingBloomFilter`` 
interface. Perhaps it should have been put into the ``BloomFilter`` interface 
instead. Adding it now would have to be a default method that returns true. We 
could then implement it correctly in all the filters in this codebase.
   
   Note: You cannot go through the indices again and remove them. You may 
remove indices that are from other object hashes added to the filter.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to