Hi Benedikt, After playing more with [text] and some edit distances, I think we can retake this conversation and hopefully fix SANDBOX-488 [1]. I've created a branch SANDBOX-488 in git [2] with the following modifications:
* The StringMetric interface has been renamed to EditDistance* We have the following edit distances available: Levenshtein, JaroWrinkler, Hamming ([lang]) and Cosine. Others might be added in the future, such as Jaccard and QGram * When an edit distance returns 0, it means both strings are identical or at least very similar. The opposite is true, returning 1, or higher values, means that the strings are less close to each other* There are other classes that can be used for text similarity, such as the FuzzyScore ([lang]), and the CosineSimilarity (used by the Cosine edit distance). Others might be added later, such as the Jaccard Index. The behaviour of each of these classes varies I think it is simpler, and users will quickly understand the API. Once one understands what is an edit distance, s/he can guess the behaviour of any of its implementations. What do you think? If you agree I'd like to merge the branch and fix the issue. TL;DR: the similarity package contains code to work on text similarity, such as edit distances, but also scores / indexes and other algorithms. The StringMetric interface has been renamed to EditDistance, and only edit distances implement it TIA Bruno [1] https://issues.apache.org/jira/browse/SANDBOX-488[2] https://git1-us-west.apache.org/repos/asf?p=commons-text.git;a=tree;f=src/main/java/org/apache/commons/text/similarity;h=a2de9f0196b543f50c6d2c28376feb311f46eeda;hb=refs/heads/SANDBOX-488 From: Benedikt Ritter <brit...@apache.org> To: Commons Developers List <dev@commons.apache.org>; Bruno P. Kinoshita <brunodepau...@yahoo.com.br> Sent: Friday, December 19, 2014 2:35 AM Subject: Re: [TEXT] Distance vs. Metric vs. Similarity 2014-12-14 23:10 GMT+01:00 Bruno P. Kinoshita <brunodepau...@yahoo.com.br>: > Sounds good, although I'm not sure I understand where you are going with> the > marker interface. What is it's purpose? Let's then keep the StringMetric interface and update its Javadoc. Thinking again, that other marker interface seems to be unnecessary. > Okay, but we need to make sure all algorithms really return a distance> then. As I said, FuzzyDistance currently really returns a similarity score.> An algorithm returning a distance should return a higher number for higher> distances. I had a look at the code, and I think I understand what you are saying now. In FuzzyDistance, the higher the score, the closer strings are. Different than what the other algorithms return. I believe I found why I named that package similarity. Probably it was because I saw that in the stringmetric library [1]. There, Levenshtein, Jaccard and other algorithms are suffixed with "Metric". How about we keep the package as similarity and simply rename the classes to [Algo]Metric too? This way we will be able to accommodate other metrics such as the Sorensen-Dice coefficient, where the higher the coefficient, more similar two strings are. WDYT? Hey Bruno, yes we can do it that way. What I want to avoid is, that the users have to check the JavaDoc every time they use an algorithms. To me it would make sense to have a number of distance algorithms and they all return a distance. Or we have Similarity algorithms and they all return a similarity. That way users can swap out the underlying algorithms without changing their code. Benedikt CheersBruno [1] https://github.com/rockymadden/stringmetric From: Benedikt Ritter <brit...@apache.org> To: Commons Developers List <dev@commons.apache.org>; Bruno P. Kinoshita <brunodepau...@yahoo.com.br> Sent: Sunday, December 14, 2014 6:45 PM Subject: Re: [TEXT] Distance vs. Metric vs. Similarity Hi Bruna, 2014-12-14 21:37 GMT+01:00 Bruno P. Kinoshita <brunodepau...@yahoo.com.br>: > > Hello Benedikt! > > Metric feels like it's something more general, but I'm not sure. > You're right. Metric was supposed to be a general interface, > representing the String Metric from the Wikipedia article. > > and the interface from StringMetric to StringDistance. > I'm reading the Myers paper, and already have a local branch with the > Myers algorithm from [collections] ported to [text]. > Perhaps we could move the StringMetric interface to o.a.c.text package, > and create StringDistance or EditDistance interface in o.a.c.text.distance. > This way we can have String Metrics as in Wikipedia, as being a way of > giving a valuefor comparing two strings. We would have the edit distances > in the distance package, and the diff algorithms in another diff package. > All of them being String Metrics. > What do you think? > Sounds good, although I'm not sure I understand where you are going with the marker interface. What is it's purpose? > > > I think we should consider renaming everything to distance, since > the> > implemented algorithms all end on *Distance. So we would change the > package> > name from o.a.c.text.similarity to o.a.c.text.distance and the > interface> > from StringMetric to StringDistance.> >> > > Looking at the code again, it seems like the algorithms all really > return a> similarity score and not a distance. For exmaple FuzzyDistance > JavaDoc> states: "A higher score indicates a higher similarity". If this is > a case,> maybe it makes more sense to rename everything to Similarity? > I'm in favor of dropping score and similarity, and adopting distance in > the package, classes and javadocs, as it is used in other tools (e.g. Solr, > Talend, Informatica IIR, etc). > Okay, but we need to make sure all algorithms really return a distance then. As I said, FuzzyDistance currently really returns a similarity score. An algorithm returning a distance should return a higher number for higher distances. Benedikt > All the best,Bruno > > > From: Benedikt Ritter <brit...@apache.org> > To: Commons Developers List <dev@commons.apache.org> > Sent: Sunday, December 14, 2014 6:20 PM > Subject: Re: [TEXT] Distance vs. Metric vs. Similarity > > 2014-12-14 21:08 GMT+01:00 Benedikt Ritter <brit...@apache.org>: > > > > Hi, > > > > currently the wording in commons text is a bit confusing. We have the > > three terms: > > > > - distance > > - similarity > > - metric > > > > Distance and similarity seem to be just opposites of the same thing. A > > great distance indicates a small similarity between two character > > sequences. Metric feels like it's something more general, but I'm not > sure. > > > > I think we should consider renaming everything to distance, since the > > implemented algorithms all end on *Distance. So we would change the > package > > name from o.a.c.text.similarity to o.a.c.text.distance and the interface > > from StringMetric to StringDistance. > > > > Looking at the code again, it seems like the algorithms all really return a > similarity score and not a distance. For exmaple FuzzyDistance JavaDoc > states: "A higher score indicates a higher similarity". If this is a case, > maybe it makes more sense to rename everything to Similarity? > > > > > > WDYT? > > > > Benedikt > > > > -- > > http://people.apache.org/~britter/ > > http://www.systemoutprintln.de/ > > http://twitter.com/BenediktRitter > > http://github.com/britter > > > > > > > -- > http://people.apache.org/~britter/ > http://www.systemoutprintln.de/ > http://twitter.com/BenediktRitter > http://github.com/britter > > > > -- http://people.apache.org/~britter/ http://www.systemoutprintln.de/ http://twitter.com/BenediktRitter http://github.com/britter -- http://people.apache.org/~britter/http://www.systemoutprintln.de/http://twitter.com/BenediktRitterhttp://github.com/britter