Claudenw commented on PR #331: URL: https://github.com/apache/commons-collections/pull/331#issuecomment-1242671100
IndexProducers may return duplicates and make no order guarantees. (there used to be an order guarantee but we removed that). Hasher based IndexProducers, by their nature, generally return unordered and possible duplicate values. There is a hasher method to produce an IndexProducer that guaranteed uniqueness. BloomFilter based IndexProducers, by their nature, generally return ordered and unique values, though I can think of implementations where the order may not be true, we don't have one. The default implementation of IndexProducer uses BitSet in its implementation to simplify the code to produce the index list. So the uniqueness is an artefact of the implementation. If you have a fast implementation that can take the forEachIndex() and convert it to an array without imposing the uniqueness constraint then please implement that. In short I concur with your assessment. The HasherCollection is intended to simplify creation of some filters. In practice it is that same as calling functions with hasher arguments once for each hasher in the collection. Due to the difference between classes of Bloom filters (e.g. standard, counting, stable) there are times when the duplicates are required. HasherCollections work well in distributed systems where an object is represented in a Bloom filter by hashes of multiple properties. A for query across the systems a HasherCollection is constructed and passed to the endpoints. The endpoints can then build filters based on the shape at the endpoint. So the specific back end systems do not have to agree on shape, but do agree on Hash algorithm. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
