aherbert commented on issue #109: TEXT-155: Add a generic IntersectionSimilarity measure URL: https://github.com/apache/commons-text/pull/109#issuecomment-470733159 This class could be used internally by the `JaccardSimilarity`. It could also be used by a new Sorenson-Dice Similarity (although I prefer F1Score). **Note: that new implementation for the Sorenson-Dice Similarity uses bigrams and not single characters.** But I do not see why it should be package private. This functionality is generic and may be of use to someone. However I am fine with either. My motivation was to try and unify the idea of splitting a `CharSequence` into `things` and then computing an intersection of two sets of `things`. If this class is public then the user can write their own method to create the `Set` from the `CharSequence`. The rest of the similarity metric is implemented. Alternatively we hide it at the package level and create the `SorensonDiceSimilarity` equivalent to the `JaccardSimilarity`. I'd suggest adding parameterised constructors to denote what size N-gram to use. The default can be 1. Then add a package level utils class to handle creating the `Set<N-gram>` for whatever size N-gram is requested. However limiting to N-grams then stops someone from computing a `Jaccard` using words, ref. the `CosineSimilarity` which splits using whitespace. If the class were public this is possible. The API link you gave (thanks) just uses functions already in Google's Guava for working with `Set` and their own version of `commons-collections` `Bag` named `Multiset`. It would be easy to extend this helper class to use a `Bag` approach but I did not do it because `commons-text` did not depend on `commons-collections`.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
