kinow commented on issue #109: TEXT-155: Add a generic IntersectionSimilarity measure URL: https://github.com/apache/commons-text/pull/109#issuecomment-471143821 >The new API using Collection is done. The class can now support duplicates. > >I have added a test to show the class can produce the same result as a case insensitive word bigram algorithm documented here: How to Strike a Match. Does it mean that the code in this pull request can be used to calculate the jaccard index, F1 score/sorensen-dice, as well as sorensen-dice with bigrams? If so we can think later what to do with #103 >Note: Somewhere between switching computers the git history broke and causes a conflict when trying to rebase. It is only 4 files so when finished (merge or not) I'll drop the branch and redo with the final files. All good :+1: Added a few comments. Thanks for the link to the _How to Strike a Match_ article. Very interesting! For that problem, at the moment, I would know only the solution using a more complete NLP library and something like [word embedding](https://en.wikipedia.org/wiki/Word_embedding) combined with some machine learning algorithm to train (which requires a lot of data, and still gives weird results). Having an edit distance that does something similar sounds quite useful for prototyping or even as a simpler solution. Recently - digressing - I needed to remove contractions in Python, and the best out-of-the-box solution I found was [pycontractions](https://github.com/ian-beaver/pycontractions) which does not scale well for thousands/million requests (maybe not even hundreds) and takes a long time to initialize some models. Plus there are some catches with one of its approaches for matching regexes for things like _U.S._. So I had to build a simpler solution for my own case, but that won't work in other projects.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
