aherbert commented on PR #331: URL: https://github.com/apache/commons-collections/pull/331#issuecomment-1242655670
I have found the index sorting and duplicate elimination routines. I know how to fix the test. The question is whether the test is indicating a problem. I think it is. The default implementation of: - IndexProducer.forEach _will_ return duplicates - IndexProducer.asIndexArray _will not_ return duplicates Is this the functionality we wish to support? If so then it should be documented. Currently there is no indication in the javadoc that asIndexArray is either sorted or distinct. I can fix the test; I can add the documentation; but should this be done? Looking at the overrides for asIndexArray: - ArrayCountingBloomFilter: _will not_ return duplicates - EnhancedDoubleHasher: _will_ return duplicates - HasherCollection: _will_ return duplicates Note that the two hashers have a code comment that they **need** to return duplicates. So we have a contradictory functionality. BloomFilters and a default IndexProducer will produce sorted distinct indices for asIndexArray. Hashers will be unsorted and may contain duplicates. The HasherCollection can output a complete jumble of indices which make no practical sense. It could be modified to make the indices from each hasher unique, then duplicates will indicate multiple hashers produced the same index. However this is not supported in the forEach method and would be a performance overhead. Currently the HasherCollection thus acts as if one item but it is built from multiple hashers. This should be in the documentation to prevent the mistake that the collection will act as if it represents multiple items. So is the suggested fix: - document that the **default** asIndexArray method will output distinct indices (current behaviour, but no contract for overrides) - update the test in DefaultIndexProducerTest.testAsIndexArray to remove duplicates from the expected array. This will then enforce the **default** behaviour that the indices are distinct Note: this does not mandate that asIndexArray must **always** return no duplicates, only that the default implementation does. Or: - document the asIndexArray method to state that indices may be duplicates (current behaviour) - update the test in DefaultIndexProducerTest.testAsIndexArray to be robust to duplicates I would favour the second approach as it avoids a commitment to certain functionality and highlights that duplicates are possible. This has minimum impact on the rest of the code. I am happy to make the second change now and finish this merge. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
