aherbert commented on PR #331:
URL: 
https://github.com/apache/commons-collections/pull/331#issuecomment-1242655670

   I have found the index sorting and duplicate elimination routines. I know 
how to fix the test. The question is whether the test is indicating a problem. 
I think it is.
   
   The default implementation of:
   - IndexProducer.forEach _will_ return duplicates
   - IndexProducer.asIndexArray _will not_ return duplicates
   
   Is this the functionality we wish to support? If so then it should be 
documented. Currently there is no indication in the javadoc that asIndexArray 
is either sorted or distinct.
   
   I can fix the test; I can add the documentation; but should this be done?
   
   Looking at the overrides for asIndexArray:
   
   - ArrayCountingBloomFilter: _will not_ return duplicates
   - EnhancedDoubleHasher: _will_ return duplicates
   - HasherCollection: _will_ return duplicates
   
   Note that the two hashers have a code comment that they **need** to return 
duplicates.
   
   So we have a contradictory functionality. BloomFilters and a default 
IndexProducer will produce sorted distinct indices for asIndexArray. Hashers 
will be unsorted and may contain duplicates. The HasherCollection can output a 
complete jumble of indices which make no practical sense. It could be modified 
to make the indices from each hasher unique, then duplicates will indicate 
multiple hashers produced the same index. However this is not supported in the 
forEach method and would be a performance overhead. Currently the 
HasherCollection thus acts as if one item but it is built from multiple 
hashers. This should be in the documentation to prevent the mistake that the 
collection will act as if it represents multiple items.
   
   So is the suggested fix:
   
   - document that the **default** asIndexArray method will output distinct 
indices (current behaviour, but no contract for overrides)
   - update the test in DefaultIndexProducerTest.testAsIndexArray to remove 
duplicates from the expected array. This will then enforce the **default** 
behaviour that the indices are distinct
   
   Note: this does not mandate that asIndexArray must **always** return no 
duplicates, only that the default implementation does.
   
   Or:
   
   - document the asIndexArray method to state that indices may be duplicates 
(current behaviour)
   - update the test in DefaultIndexProducerTest.testAsIndexArray to be robust 
to duplicates
   
   I would favour the second approach as it avoids a commitment to certain 
functionality and highlights that duplicates are possible. This has minimum 
impact on the rest of the code. I am happy to make the second change now and 
finish this merge.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to