[jira] [Work logged] (TEXT-155) Add a generic SetSimilarity measure

ASF GitHub Bot (JIRA) Thu, 07 Mar 2019 15:19:38 -0800


     [ 
https://issues.apache.org/jira/browse/TEXT-155?focusedWorklogId=209823&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-209823
 ]


ASF GitHub Bot logged work on TEXT-155:
---------------------------------------

                Author: ASF GitHub Bot
            Created on: 07/Mar/19 22:54
            Start Date: 07/Mar/19 22:54
    Worklog Time Spent: 10m 
      Work Description: aherbert commented on issue #109: TEXT-155: Add a 
generic IntersectionSimilarity measure
URL: https://github.com/apache/commons-text/pull/109#issuecomment-470733159
 
 
   This class could be used internally by the `JaccardSimilarity`. It could 
also be used by a new Sorenson-Dice Similarity (although I prefer F1Score). 
**Note: that new  implementation for the Sorenson-Dice Similarity uses bigrams 
and not single characters.**
   
   But I do not see why it should be package private. This functionality is 
generic and may be of use to someone. However I am fine with either. My 
motivation was to try and unify the idea of splitting a `CharSequence` into 
`things` and then computing an intersection of two sets of `things`. 
   
   If this class is public then the user can write their own method to create 
the `Set` from the `CharSequence`. The rest of the similarity metric is 
implemented. Alternatively we hide it at the package level and create the 
`SorensonDiceSimilarity` equivalent to the `JaccardSimilarity`. I'd suggest 
adding parameterised constructors to denote what size N-gram to use. The 
default can be 1. Then add a package level utils class to handle creating the 
`Set<N-gram>` for whatever size N-gram is requested.
   
   However limiting to N-grams then stops someone from computing a `Jaccard` 
using words, ref. the `CosineSimilarity` which splits using whitespace. If the 
class were public this is possible.
   
   The API link you gave (thanks) just uses functions already in Google's Guava 
for working with `Set` and their own version of `commons-collections` `Bag` 
named `Multiset`. It would be easy to extend this helper class to use a `Bag` 
approach but I did not do it because `commons-text` did not depend on 
`commons-collections`.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 209823)
    Time Spent: 50m  (was: 40m)

> Add a generic SetSimilarity measure
> -----------------------------------
>
>                 Key: TEXT-155
>                 URL: https://issues.apache.org/jira/browse/TEXT-155
>             Project: Commons Text
>          Issue Type: New Feature
>    Affects Versions: 1.6
>            Reporter: Alex D Herbert
>            Priority: Minor
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> The {{SimilarityScore<T>}} interface can be used to compute a generic result. 
> I propose to add a class that can compute the intersection between two sets 
> formed from the characters. The sets must be formed from the {{CharSequence}} 
> input to the {{apply}} method using a {{Function<CharSequence, Set<T>>}} to 
> convert the {{CharSequence}}. This function can be passed to the 
> {{SimilarityScore<T>}} during construction.
> The result can then be computed to have the size of each set and the 
> intersection.
> I have created an implementation that can compute the equivalent of the 
> {{JaccardSimilary}} class by creating {{Set<Character>}} and also the 
> F1-score using bigrams (pairs of characters) by creating {{Set<String>}}. 
> This relates to 
> [Text-126|https://issues.apache.org/jira/projects/TEXT/issues/TEXT-126] which 
> suggested an algorithm for the Sorensen-Dice similarity, also known as the 
> F1-score.
> Here is an example:
> {code:java}
> // Match the functionality of the JaccardSimilarity class
> Function<CharSequence, Set<Character>> converter = (cs) -> {
>     final Set<Character> set = new HashSet<>();
>     for (int i = 0; i < cs.length(); i++) {
>         set.add(cs.charAt(i));
>     }
>     return set;
> };
> IntersectionSimilarity<Character> similarity = new 
> IntersectionSimilarity<>(converter);
> IntersectionResult result = similarity.apply("something", "something else");
> {code}
> The result has the size of set A, set B and the intersection between them.
> This class was inspired by my look through the various similarity 
> implementations. All of them except the {{CosineSimilarity}} perform single 
> character matching between the input {{CharSequence}}s. The 
> {{CosineSimilarity}} tokenises using whitespace to create words.
> This more generic type of implementation will allow a user to determine how 
> to divide the {{CharSequence}} but to create the sets that are compared, e.g. 
> single characters, words, bigrams, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (TEXT-155) Add a generic SetSimilarity measure

Reply via email to