[GitHub] [commons-text] aherbert commented on issue #103: TEXT-126: Adding Sorensen-Dice similarity algoritham

GitBox Sun, 24 Mar 2019 08:06:06 -0700

aherbert commented on issue #103: TEXT-126: Adding Sorensen-Dice similarity 
algoritham
URL: https://github.com/apache/commons-text/pull/103#issuecomment-475967738
 
 
   Overall this class now works nicely. But I think that my initial 
reservations have not been met.
   
   There is no reason that this should use bigrams, or allow duplicate bigrams.
   
   If this goes through then you end up with a library that can compute the 
Jaccard with unique single characters (by using a `Set`) and the Sorensen with 
bigrams including duplicates (by using a `List` converted to a `Bag`). Since 
the two scores are related by the equations:
   
   ```
   J = S / (2 - S)
   S = 2J / (1 + J)
   ```
   
   It really does not make sense to compute one with a different method from 
the other. 
   
   I would actually state we have:
   
   - J1U and S2D
   
   Where the number is the number of characters (n-gram), U is for unique 
(non-duplicates), and D is for duplicates.
   
   We actually have possibilities for the following in the library:
   
   - J1U
   - J1D
   - J2U
   - J2D
   - S1U
   - S1D
   - S2U
   - S2D
   
   Or generically: [JS]N[UD]
   
   Putting in an S2D metric to complement a J1U metric makes a confusing and 
inconsistent library.
   
   BTW. I believe bigrams is better and duplicates is better. So this is the 
best algorithm. It just totally contradicts the current Jaccard implementation.
   
   A cleaner design is to have both metrics support single or bigrams (or even 
n-grams) and also allow or prevent duplicates via constructor arguments.
   
   The current Jaccard can be modified to do so by allowing the default 
constructor to default to single, non-duplicate mode and overloaded 
constructors added to it.
   
   This idea may be best achieved by pushing shared functionality into a 
`NGramIntersectionResult` class that accepts the size of the n-gram and the 
duplicates option and computes the IntersectionResult. This can internally use 
an `IntersectionSimilarity` class to do the computation.
   
   ```
   class NGramIntersectionResult {
   
       NGramIntersectionResult(int ngram, boolean duplicates) {
            // ...
       }
   
       IntersectionResult compute(CharSequence cs1, CharSequence cs2) {
            // Use IntersectionSimilarity to compute the IntersectionResult:
            // Pass to the IntersectionSimilarity the function to split the 
CharSequence into 
            // the appropriate collection (Set or List) using n-grams
            //
            // All the handling of the generic type (Character, Integer, 
String) used for the n-gram
            // should be done in here and hidden from the API.
       }
   }
   ```
   
   This can be made more generic if necessary. But this then leads to the 
library that @kinow identified 
([Simetrics](https://github.com/Simmetrics/simmetrics)) which can support 
tokenisation of input, case handling, splitting to n-grams, handling 
duplicates, extended unicode character handling, etc. 
   
   This level of generalisation is out of scope for the current similarity 
package. But n-grams and duplicates should be in scope because this is the 
difference between this algorithm for S2D and the current library J1U algorithm.
   
   @ameyjadiye Would you like to add support for n-grams and unique/duplicate 
matching to this or attempt to add a package scope class that can be shared 
with the JaccardSimilarity?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [commons-text] aherbert commented on issue #103: TEXT-126: Adding Sorensen-Dice similarity algoritham

Reply via email to