aherbert commented on issue #109: TEXT-155: Add a generic 
IntersectionSimilarity measure
URL: https://github.com/apache/commons-text/pull/109#issuecomment-470882543
 
 
   I'd like to change this a bit to support the `Bag` functionality. One big 
nit I had was that:
   
   ```
   "aa" vs "aa"       == 1
   "aa" vs "aaaaaaaa" == 1            (with no duplicates)
   "aa" vs "aaaaaaaa" == 2 / 8 = 0.25 (with duplicates)
   ```
   
   I will update the API so that the `Function` is altered to a 
`Function<CharSequence, Collection<T>>`. The user just then need to split the 
`CharSequence` into a collection. If it is a `Set` then the assumption would be 
that the user would like to perform an intersection without duplicates. 
Anything else can be converted internally to a `Bag` representation. I'll put 
the bag functionality as a private class and only borrow what is needed from 
commons-collections. The amount of code will be minimal as it is just a 
`HashMap<T, MutableInteger>` so the count of the items in the set can be 
adjusted.
   
   This will get it working. If a dependency on commons-collections is not a 
problem then the internal class can be removed later without breaking binary 
compatibility.
   
   I'll also change F1Score to SorensenDiceCoefficient so the name matches the 
parallel work in #103.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to