Alex D Herbert created TEXT-155:
-----------------------------------

             Summary: Add a generic SetSimilarity measure
                 Key: TEXT-155
                 URL: https://issues.apache.org/jira/browse/TEXT-155
             Project: Commons Text
          Issue Type: New Feature
    Affects Versions: 1.6
            Reporter: Alex D Herbert


The {{SimilarityScore<T>}} interface can be used to compute a generic result. I 
propose to add a class that can compute the intersection between two sets 
formed from the characters. The sets must be formed from the {{CharSequence}} 
input to the {{apply}} method using a {{Function<CharSequence, Set<T>>}} to 
convert the {{CharSequence}}. This function can be passed to the 
{{SimilarityScore<T>}} during construction.

The result can then be computed to have the size of each set and the 
intersection.

I have created an implementation that can compute the equivalent of the 
{{JaccardSimilary}} class by creating {{Set<Character>}} and also the F1-score 
using bigrams (pairs of characters) by creating {{Set<String>}}. This relates 
to [Text-126|https://issues.apache.org/jira/projects/TEXT/issues/TEXT-126] 
which suggested an algorithm for the Sorensen-Dice similarity, also known as 
the F1-score.

Here is an example:
{code:java}
// Match the functionality of the JaccardSimilarity class
Function<CharSequence, Set<Character>> converter = (cs) -> {
    final Set<Character> set = new HashSet<>();
    for (int i = 0; i < cs.length(); i++) {
        set.add(cs.charAt(i));
    }
    return set;
};
IntersectionSimilarity<Character> similarity = new 
IntersectionSimilarity<>(converter);
IntersectionResult result = similarity.apply("something", "something else");
{code}
The result has the size of set A, set B and the intersection between them.

This class was inspired by my look through the various similarity 
implementations. All of them except the {{CosineSimilarity}} perform single 
character matching between the input {{CharSequence}}s. The 
{{CosineSimilarity}} tokenises using whitespace to create words.

This more generic type of implementation will allow a user to determine how to 
divide the {{CharSequence}} but to create the sets that are compared, e.g. 
single characters, words, bigrams, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to