aherbert commented on issue #109: TEXT-155: Add a generic 
IntersectionSimilarity measure
URL: https://github.com/apache/commons-text/pull/109#issuecomment-471171172
 
 
   Hi Bruno,
   
   I'll cover most things in this comment.
   
   1. Name
   
   Since this class computes the intersect I called it 
`IntersectionSimilarity`. However it also computes the union. A separate class 
for compute the union does not makes sense since the two things are 
intertwined. It is a classic [Venn 
diagram](https://en.wikipedia.org/wiki/Venn_diagram) of two sets. If you know 
the size of each set then if you compute the intersect you can know the union 
and vice versa.
   
   How about `OverlapSimilarity`?
   
   2. Computing the Jaccard + Sorenson Dice
   
   I had added it mainly during testing to check against known results. But 
there are a lot more metrics than can be computed using the true-positive 
(intersect) and false-positive (size A - intersect) and false negatives (size B 
- intersect). Since we are not going to add them all then change to:
   
   `OverlapResult` - defines set size A, B, intersect and union (defined using 
the other 3 values).
   
   It could also define falsePositives (using A) and falseNegatives (using B) 
as utility methods. Or leave them out as we do not need them at the moment.
   
   3. Shared functionality
   
   This class can then be used within existing similarity scores in the library 
(Jaccard) and for the new Sorensen-Dice similarity. I thought that was the 
intension.
   
   It will be up to those classes to appropriately use the `OverlapResult` to 
compute their score.
   
   Alex
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to