[jira] [Work logged] (TEXT-155) Add a generic SetSimilarity measure

ASF GitHub Bot (JIRA) Sat, 09 Mar 2019 04:01:40 -0800


     [ 
https://issues.apache.org/jira/browse/TEXT-155?focusedWorklogId=210522&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-210522
 ]


ASF GitHub Bot logged work on TEXT-155:
---------------------------------------

                Author: ASF GitHub Bot
            Created on: 09/Mar/19 12:00
            Start Date: 09/Mar/19 12:00
    Worklog Time Spent: 10m 
      Work Description: aherbert commented on issue #109: TEXT-155: Add a 
generic IntersectionSimilarity measure
URL: https://github.com/apache/commons-text/pull/109#issuecomment-471171172
 
 
   Hi Bruno,
   
   I'll cover most things in this comment.
   
   1. Name
   
   Since this class computes the intersect I called it 
`IntersectionSimilarity`. However it also computes the union. A separate class 
for compute the union does not makes sense since the two things are 
intertwined. It is a classic [Venn 
diagram](https://en.wikipedia.org/wiki/Venn_diagram) of two sets. If you know 
the size of each set then if you compute the intersect you can know the union 
and vice versa.
   
   How about `OverlapSimilarity`?
   
   2. Computing the Jaccard + Sorenson Dice
   
   I had added it mainly during testing to check against known results. But 
there are a lot more metrics than can be computed using the true-positive 
(intersect) and false-positive (size A - intersect) and false negatives (size B 
- intersect). Since we are not going to add them all then change to:
   
   `OverlapResult` - defines set size A, B, intersect and union (defined using 
the other 3 values).
   
   It could also define falsePositives (using A) and falseNegatives (using B) 
as utility methods. Or leave them out as we do not need them at the moment.
   
   3. Shared functionality
   
   This class can then be used within existing similarity scores in the library 
(Jaccard) and for the new Sorensen-Dice similarity. I thought that was the 
intension.
   
   It will be up to those classes to appropriately use the `OverlapResult` to 
compute their score.
   
   Alex
   
   
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 210522)
    Time Spent: 3h 10m  (was: 3h)

> Add a generic SetSimilarity measure
> -----------------------------------
>
>                 Key: TEXT-155
>                 URL: https://issues.apache.org/jira/browse/TEXT-155
>             Project: Commons Text
>          Issue Type: New Feature
>    Affects Versions: 1.6
>            Reporter: Alex D Herbert
>            Priority: Minor
>          Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> The {{SimilarityScore<T>}} interface can be used to compute a generic result. 
> I propose to add a class that can compute the intersection between two sets 
> formed from the characters. The sets must be formed from the {{CharSequence}} 
> input to the {{apply}} method using a {{Function<CharSequence, Set<T>>}} to 
> convert the {{CharSequence}}. This function can be passed to the 
> {{SimilarityScore<T>}} during construction.
> The result can then be computed to have the size of each set and the 
> intersection.
> I have created an implementation that can compute the equivalent of the 
> {{JaccardSimilary}} class by creating {{Set<Character>}} and also the 
> F1-score using bigrams (pairs of characters) by creating {{Set<String>}}. 
> This relates to 
> [Text-126|https://issues.apache.org/jira/projects/TEXT/issues/TEXT-126] which 
> suggested an algorithm for the Sorensen-Dice similarity, also known as the 
> F1-score.
> Here is an example:
> {code:java}
> // Match the functionality of the JaccardSimilarity class
> Function<CharSequence, Set<Character>> converter = (cs) -> {
>     final Set<Character> set = new HashSet<>();
>     for (int i = 0; i < cs.length(); i++) {
>         set.add(cs.charAt(i));
>     }
>     return set;
> };
> IntersectionSimilarity<Character> similarity = new 
> IntersectionSimilarity<>(converter);
> IntersectionResult result = similarity.apply("something", "something else");
> {code}
> The result has the size of set A, set B and the intersection between them.
> This class was inspired by my look through the various similarity 
> implementations. All of them except the {{CosineSimilarity}} perform single 
> character matching between the input {{CharSequence}}s. The 
> {{CosineSimilarity}} tokenises using whitespace to create words.
> This more generic type of implementation will allow a user to determine how 
> to divide the {{CharSequence}} but to create the sets that are compared, e.g. 
> single characters, words, bigrams, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (TEXT-155) Add a generic SetSimilarity measure

Reply via email to