aherbert commented on issue #103: TEXT-126: Adding Sorensen-Dice similarity algoritham URL: https://github.com/apache/commons-text/pull/103#issuecomment-471176120 I'll try and summarise where we are: 1. It is not clear what the result should be of 0 / 0 This is the `"" vs ""` case. Possibles: ``` 0 / 0 = n / n = 1 0 / 0 = n / 0 = NaN 0 / 0 = 0 / n = 0 ``` The library I found (python distance library) dealt with this by returning NaN. Others have been found that return 1 (because they are the same, even if they are the same of nothing). The current `JaccardSimilarity` has decided to return 0. The new `SorensenDiceSimilarity` returns 1. So this discrepancy in our own library should be resolved. 2. characters or bigrams How do you split up a `CharSequence` to build a `Set`? There are so many ways. Or do you build a `List`? There are library examples for splitting a string into a set of characters (python distance library) and for using bigrams. I note that @kinow example, the python `textdistance` library, supports using a `Set` or `List`, and supports variable length n-grams: ``` >>> textdistance.Sorensen(qval=1, as_set=1)("aaaba","aab") 1.0 >>> textdistance.Sorensen(qval=1, as_set=0)("aaaba","aab") 0.75 >>> textdistance.Sorensen(qval=2, as_set=1)("aaaba","aab") 0.8 >>> textdistance.Sorensen(qval=2, as_set=0)("aaaba","aab") 0.6666666666666666 ``` This problem is solved by #109 which allows the user to define how to split the sequence with a function. The current `JaccardSimilarity` has decided to split into single characters. The new `SorensenDiceSimilarity` splits into bigrams. Both use sets. So this discrepancy should be resolved. I think the solution is to support an n-gram size argument for the measure with default of 1. Then support a `useSet` flag with default of `true`. This is all possible using #109 to do the intersect computation. The key is to not break compatibility for anyone already using the `JaccardSimilarity`. 3. How to score a `CharSequence` using bigrams with only 1 character ``` "a" vs "a" = 0 or 1? "a" vs "b" = 0 or NaN? ``` This is similar to point 1 where you are actually scoring 0/0 again since you have no bigrams. 4. Handling null Currently this just throws an exception. However `null` is similar to `null`. So if the library decides to not support `null` because it does not exists where does it stand on a zero length `CharSequence`. This does also not exist, but does have established meaning in the context of strings. Currently the stance in the Jaccard is it is a programming error to pass around `null` and expect comparisons but it is perfectly reasonable to pass around empty stuff. **Discussion** To consolidate 0, 1 and 4 perhaps we can change the library to state that: - Similarity is undefined for `null` and throw an exception - Similarity between two empty sequences is either: (a) exception; (b) perfect; or (c) nothing - Similarity between equal sequences that do not satisfy the criteria for the comparison algorithm is either: (a) perfect; or (b) nothing Then support n-gram size and both the `Set` or `List` approach to defining the two collections of n-grams to compare. This involves some decisions. My vote is: - Add n-gram and useSet parameters to the `JaccardSimilarity` - Add n-gram and useSet parameters to the `SorensenDiceSimilarity` - Make similarity 1 if the sequence does not satisfy the criteria for the algorithm but is identical. This is less destructive than throwing an exception or returning NaN As long as the Javadoc explains the chosen functionality then the I would be happy with any of the options. But my preference would be for equality being similar: ``` "" vs "" == 1 (using a character comparison algorithm) "a" vs "a" == 1 (using a bigram algorithm) ``` This is in-line with other similarity measures in the library which state equal sequences are perfectly similar. Implementing a more generic approach can be done using methods in #109. So perhaps hold this PR until that is resolved. Alex
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
