[jira] [Work logged] (TEXT-155) Add a generic SetSimilarity measure

ASF GitHub Bot (JIRA) Fri, 08 Mar 2019 20:18:00 -0800


     [ 
https://issues.apache.org/jira/browse/TEXT-155?focusedWorklogId=210455&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-210455
 ]


ASF GitHub Bot logged work on TEXT-155:
---------------------------------------

                Author: ASF GitHub Bot
            Created on: 09/Mar/19 04:16
            Start Date: 09/Mar/19 04:16
    Worklog Time Spent: 10m 
      Work Description: kinow commented on issue #109: TEXT-155: Add a generic 
IntersectionSimilarity measure
URL: https://github.com/apache/commons-text/pull/109#issuecomment-471143821
 
 
   >The new API using Collection is done. The class can now support duplicates.
   >
   >I have added a test to show the class can produce the same result as a case 
insensitive word bigram algorithm documented here: How to Strike a Match.
   
   Does it mean that the code in this pull request can be used to calculate the 
jaccard index, F1 score/sorensen-dice, as well as sorensen-dice with bigrams? 
If so we can think later what to do with #103 
   
   >Note: Somewhere between switching computers the git history broke and 
causes a conflict when trying to rebase. It is only 4 files so when finished 
(merge or not) I'll drop the branch and redo with the final files.
   
   All good :+1: 
   
   Added a few comments. Thanks for the link to the _How to Strike a Match_ 
article. Very interesting! For that problem, at the moment, I would know only 
the solution using a more complete NLP library and something like [word 
embedding](https://en.wikipedia.org/wiki/Word_embedding) combined with some 
machine learning algorithm to train (which requires a lot of data, and still 
gives weird results). Having an edit distance that does something similar 
sounds quite useful for prototyping or even as a simpler solution.
   
   Recently - digressing - I needed to remove contractions in Python, and the 
best out-of-the-box solution I found was 
[pycontractions](https://github.com/ian-beaver/pycontractions) which does not 
scale well for thousands/million requests (maybe not even hundreds) and takes a 
long time to initialize some models. Plus there are some catches with one of 
its approaches for matching regexes for things like _U.S._. So I had to build a 
simpler solution for my own case, but that won't work in other projects.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 210455)
    Time Spent: 3h  (was: 2h 50m)

> Add a generic SetSimilarity measure
> -----------------------------------
>
>                 Key: TEXT-155
>                 URL: https://issues.apache.org/jira/browse/TEXT-155
>             Project: Commons Text
>          Issue Type: New Feature
>    Affects Versions: 1.6
>            Reporter: Alex D Herbert
>            Priority: Minor
>          Time Spent: 3h
>  Remaining Estimate: 0h
>
> The {{SimilarityScore<T>}} interface can be used to compute a generic result. 
> I propose to add a class that can compute the intersection between two sets 
> formed from the characters. The sets must be formed from the {{CharSequence}} 
> input to the {{apply}} method using a {{Function<CharSequence, Set<T>>}} to 
> convert the {{CharSequence}}. This function can be passed to the 
> {{SimilarityScore<T>}} during construction.
> The result can then be computed to have the size of each set and the 
> intersection.
> I have created an implementation that can compute the equivalent of the 
> {{JaccardSimilary}} class by creating {{Set<Character>}} and also the 
> F1-score using bigrams (pairs of characters) by creating {{Set<String>}}. 
> This relates to 
> [Text-126|https://issues.apache.org/jira/projects/TEXT/issues/TEXT-126] which 
> suggested an algorithm for the Sorensen-Dice similarity, also known as the 
> F1-score.
> Here is an example:
> {code:java}
> // Match the functionality of the JaccardSimilarity class
> Function<CharSequence, Set<Character>> converter = (cs) -> {
>     final Set<Character> set = new HashSet<>();
>     for (int i = 0; i < cs.length(); i++) {
>         set.add(cs.charAt(i));
>     }
>     return set;
> };
> IntersectionSimilarity<Character> similarity = new 
> IntersectionSimilarity<>(converter);
> IntersectionResult result = similarity.apply("something", "something else");
> {code}
> The result has the size of set A, set B and the intersection between them.
> This class was inspired by my look through the various similarity 
> implementations. All of them except the {{CosineSimilarity}} perform single 
> character matching between the input {{CharSequence}}s. The 
> {{CosineSimilarity}} tokenises using whitespace to create words.
> This more generic type of implementation will allow a user to determine how 
> to divide the {{CharSequence}} but to create the sets that are compared, e.g. 
> single characters, words, bigrams, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (TEXT-155) Add a generic SetSimilarity measure

Reply via email to