aherbert commented on issue #103: TEXT-126: Adding Sorensen-Dice similarity 
algoritham
URL: https://github.com/apache/commons-text/pull/103#issuecomment-470311464
 
 
   Hi @ameyjadiye,
   
   The code is now a good working implementation. However it should be noted 
that the Sorensen-Dice similarity is a binary scoring metric. It can be applied 
to anything where you have two sets to be matched. It is closely related to the 
Jaccard score.
   
   This PR brings in a new algorithm to compute matches using pairs of 
characters (bigrams). This is a conflict with existing similarity measures in 
the package implementing `SimilarityScore<R>`:
   
   JaccardSimilarity - uses `Set<Character>` for matching single characters to 
produce `Double`
   JaroWinklerSimilarity - uses single character matching to produce an edit 
distance which is then converted to a similarity `Double`
   LongestCommonSubsequence - uses single character matching for sub-sequences 
to produce `Integer`
   
   The rest of the package is for an extension of `SimilarityScore<R>` which is 
`EditDistance<R>`:
   
   CosineDistance - uses a `RegexTokenizer` on whitespace to match words to 
produce `Double`
   HammingDistance - uses single character matching to produce `Integer`
   JaccardDistance - just inverts the JaccardSimilarity to produce `Double`
   JaroWinklerDistance - complementary of Jaro-Winkler similarity to produce 
`Double`
   LevenshteinDetailedDistance - single character changes to produce 
`LevenshteinResults`
   LevenshteinDistance - single character changes to produce `Integer`
   LongestCommonSubsequenceDistance - single character based substring to 
produce `Double`
   
   **So all the other measures use single characters or words** 
(CosineDistance).
   
   If you want a Sorensen-Dice distance score using single characters then just 
use the `JaccardDistance` and map it: `S = 2J / (1 + J)`
   
   This class is effectively a JaccardDistance with `bigrams` but with a mapped 
output score.
   
   One suggestion would be a change to be something like 
`BigramJaccardSimilarity` and then compute the Jaccard using bigrams. Then put 
new classes in for `SorensenDiceSimilarity` and `BiSorensenDiceSimilarity`.
   
   However the current JaccardDistance uses Set<Character> which has space and 
efficiency advantages over Set<String>. Once you are using strings then there 
is no reason to have only bigrams.
   
   My preference would be to add a new class `TokenizerJaccardSimilarity` (or 
something nicer) that accepts a `Tokenizer<CharSequence>` in the constructor. 
This then tokenises the input into two sets and computes the union of the two 
sets.
   
   Your `bigram` case can be fulfilled by writing a new 
`Tokenizer<CharSequence>` that returns the bigrams. This new class would be 
flexible for computing the Jaccard for words, bigrams, trigrams, n-grams, or 
whatever else you can tokenise a `CharSequence` into.
   
   Then you add a `TokenSorensonDiceSimilarity` that just maps the Jaccard 
score. You can see this for example in the `JaccardDistance` class.
   
   @kinow Opinions on a more flexible approach?
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to