aherbert commented on issue #109: TEXT-155: Add a generic IntersectionSimilarity measure URL: https://github.com/apache/commons-text/pull/109#issuecomment-470882543 I'd like to change this a bit to support the `Bag` functionality. One big nit I had was that: ``` "aa" vs "aa" == 1 "aa" vs "aaaaaaaa" == 1 (with no duplicates) "aa" vs "aaaaaaaa" == 2 / 8 = 0.25 (with duplicates) ``` I will update the API so that the `Function` is altered to a `Function<CharSequence, Collection<T>>`. The user just then need to split the `CharSequence` into a collection. If it is a `Set` then the assumption would be that the user would like to perform an intersection without duplicates. Anything else can be converted internally to a `Bag` representation. I'll put the bag functionality as a private class and only borrow what is needed from commons-collections. The amount of code will be minimal as it is just a `HashMap<T, MutableInteger>` so the count of the items in the set can be adjusted. This will get it working. If a dependency on commons-collections is not a problem then the internal class can be removed later without breaking binary compatibility. I'll also change F1Score to SorensenDiceCoefficient so the name matches the parallel work in #103.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
