[GitHub] [commons-text] aherbert commented on issue #103: TEXT-126: Adding Sorensen-Dice similarity algoritham

GitBox Sat, 09 Mar 2019 05:16:50 -0800

aherbert commented on issue #103: TEXT-126: Adding Sorensen-Dice similarity 
algoritham
URL: https://github.com/apache/commons-text/pull/103#issuecomment-471176120
 
 
   I'll try and summarise where we are:
   
   1. It is not clear what the result should be of 0 / 0
   
   This is the `"" vs ""` case.
   
   Possibles:
   ```
   0 / 0 = n / n = 1
   0 / 0 = n / 0 = NaN
   0 / 0 = 0 / n = 0
   ```
   
   The library I found (python distance library) dealt with this by returning 
NaN. Others have been found that return 1 (because they are the same, even if 
they are the same of nothing).
   
   The current `JaccardSimilarity` has decided to return 0.
   
   The new `SorensenDiceSimilarity` returns 1.
   
   So this discrepancy in our own library should be resolved.
   
   2. characters or bigrams
   
   How do you split up a `CharSequence` to build a `Set`? There are so many 
ways. Or do you build a `List`?
   
   There are library examples for splitting a string into a set of characters 
(python distance library) and for using bigrams. I note that @kinow example, 
the python `textdistance` library, supports using a `Set` or `List`, and 
supports variable length n-grams:
   
   ```
   >>> textdistance.Sorensen(qval=1, as_set=1)("aaaba","aab")
   1.0
   >>> textdistance.Sorensen(qval=1, as_set=0)("aaaba","aab")
   0.75
   >>> textdistance.Sorensen(qval=2, as_set=1)("aaaba","aab")
   0.8
   >>> textdistance.Sorensen(qval=2, as_set=0)("aaaba","aab")
   0.6666666666666666
   ```
   
   This problem is solved by #109 which allows the user to define how to split 
the sequence with a function.
   
   The current `JaccardSimilarity` has decided to split into single characters.
   
   The new `SorensenDiceSimilarity` splits into bigrams.
   
   Both use sets. So this discrepancy should be resolved.
   
   I think the solution is to support an n-gram size argument for the measure 
with default of 1. Then support a `useSet` flag with default of `true`. This is 
all possible using #109 to do the intersect computation. The key is to not 
break compatibility for anyone already using the `JaccardSimilarity`.
   
   3. How to score a `CharSequence` using bigrams with only 1 character
   
   ```
   "a" vs "a"  = 0   or   1?
   "a" vs "b"  = 0   or   NaN?
   ```
   
   This is similar to point 1 where you are actually scoring 0/0 again since 
you have no bigrams.
   
   4. Handling null
   
   Currently this just throws an exception. However `null` is similar to `null`.
   
   So if the library decides to not support `null` because it does not exists 
where does it stand on a zero length `CharSequence`. This does also not exist, 
but does have established meaning in the context of strings.
   
   Currently the stance in the Jaccard is it is a programming error to pass 
around `null` and expect comparisons but it is perfectly reasonable to pass 
around empty stuff.
   
   
   **Discussion**
   
   To consolidate 0, 1 and 4 perhaps we can change the library to state that:
   
   - Similarity is undefined for `null` and throw an exception
   - Similarity between two empty sequences is either: (a) exception; (b) 
perfect; or (c) nothing
   - Similarity between equal sequences that do not satisfy the criteria for 
the comparison algorithm is either: (a) perfect; or (b) nothing
   
   Then support n-gram size and both the `Set` or `List` approach to defining 
the two collections of n-grams to compare.
   
   This involves some decisions.
   
   My vote is:
   
   - Add n-gram and useSet parameters to the `JaccardSimilarity`
   - Add n-gram and useSet parameters to the `SorensenDiceSimilarity`
   - Make similarity 1 if the sequence does not satisfy the criteria for the 
algorithm but is identical. This is less destructive than throwing an exception 
or returning NaN
   
   As long as the Javadoc explains the chosen functionality then the I would be 
happy with any of the options. But my preference would be for equality being 
similar:
   
   ```
   "" vs "" == 1 (using a character comparison algorithm)
   "a" vs "a" == 1 (using a bigram algorithm)
   ```
   
   This is in-line with other similarity measures in the library which state 
equal sequences are perfectly similar.
   
   Implementing a more generic approach can be done using methods in #109. So 
perhaps hold this PR until that is resolved.
   
   Alex


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [commons-text] aherbert commented on issue #103: TEXT-126: Adding Sorensen-Dice similarity algoritham

Reply via email to