aherbert commented on issue #109: TEXT-155: Add a generic IntersectionSimilarity measure URL: https://github.com/apache/commons-text/pull/109#issuecomment-471171172 Hi Bruno, I'll cover most things in this comment. 1. Name Since this class computes the intersect I called it `IntersectionSimilarity`. However it also computes the union. A separate class for compute the union does not makes sense since the two things are intertwined. It is a classic [Venn diagram](https://en.wikipedia.org/wiki/Venn_diagram) of two sets. If you know the size of each set then if you compute the intersect you can know the union and vice versa. How about `OverlapSimilarity`? 2. Computing the Jaccard + Sorenson Dice I had added it mainly during testing to check against known results. But there are a lot more metrics than can be computed using the true-positive (intersect) and false-positive (size A - intersect) and false negatives (size B - intersect). Since we are not going to add them all then change to: `OverlapResult` - defines set size A, B, intersect and union (defined using the other 3 values). It could also define falsePositives (using A) and falseNegatives (using B) as utility methods. Or leave them out as we do not need them at the moment. 3. Shared functionality This class can then be used within existing similarity scores in the library (Jaccard) and for the new Sorensen-Dice similarity. I thought that was the intension. It will be up to those classes to appropriately use the `OverlapResult` to compute their score. Alex
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
