aherbert commented on issue #103: TEXT-126: Adding Sorensen-Dice similarity algoritham URL: https://github.com/apache/commons-text/pull/103#issuecomment-470311464 Hi @ameyjadiye, The code is now a good working implementation. However it should be noted that the Sorensen-Dice similarity is a binary scoring metric. It can be applied to anything where you have two sets to be matched. It is closely related to the Jaccard score. This PR brings in a new algorithm to compute matches using pairs of characters (bigrams). This is a conflict with existing similarity measures in the package implementing `SimilarityScore<R>`: JaccardSimilarity - uses `Set<Character>` for matching single characters to produce `Double` JaroWinklerSimilarity - uses single character matching to produce an edit distance which is then converted to a similarity `Double` LongestCommonSubsequence - uses single character matching for sub-sequences to produce `Integer` The rest of the package is for an extension of `SimilarityScore<R>` which is `EditDistance<R>`: CosineDistance - uses a `RegexTokenizer` on whitespace to match words to produce `Double` HammingDistance - uses single character matching to produce `Integer` JaccardDistance - just inverts the JaccardSimilarity to produce `Double` JaroWinklerDistance - complementary of Jaro-Winkler similarity to produce `Double` LevenshteinDetailedDistance - single character changes to produce `LevenshteinResults` LevenshteinDistance - single character changes to produce `Integer` LongestCommonSubsequenceDistance - single character based substring to produce `Double` **So all the other measures use single characters or words** (CosineDistance). If you want a Sorensen-Dice distance score using single characters then just use the `JaccardDistance` and map it: `S = 2J / (1 + J)` This class is effectively a JaccardDistance with `bigrams` but with a mapped output score. One suggestion would be a change to be something like `BigramJaccardSimilarity` and then compute the Jaccard using bigrams. Then put new classes in for `SorensenDiceSimilarity` and `BiSorensenDiceSimilarity`. However the current JaccardDistance uses Set<Character> which has space and efficiency advantages over Set<String>. Once you are using strings then there is no reason to have only bigrams. My preference would be to add a new class `TokenizerJaccardSimilarity` (or something nicer) that accepts a `Tokenizer<CharSequence>` in the constructor. This then tokenises the input into two sets and computes the union of the two sets. Your `bigram` case can be fulfilled by writing a new `Tokenizer<CharSequence>` that returns the bigrams. This new class would be flexible for computing the Jaccard for words, bigrams, trigrams, n-grams, or whatever else you can tokenise a `CharSequence` into. Then you add a `TokenSorensonDiceSimilarity` that just maps the Jaccard score. You can see this for example in the `JaccardDistance` class. @kinow Opinions on a more flexible approach?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
