Re: [TEXT] Distance vs. Metric vs. Similarity

Bruno P. Kinoshita Wed, 15 Apr 2015 03:19:33 -0700

Hi Benedikt,
After playing more with [text] and some edit distances, I think we can retake 
this conversation and hopefully fix SANDBOX-488 [1].
I've created a branch SANDBOX-488 in git [2] with the following modifications:


* The StringMetric interface has been renamed to EditDistance* We have the 
following edit distances available: Levenshtein, JaroWrinkler, Hamming ([lang]) 
and Cosine. Others might be added in the future, such as Jaccard and QGram
* When an edit distance returns 0, it means both strings are identical or at 
least very similar. The opposite is true, returning 1, or higher values, means 
that the strings are less close to each other* There are other classes that can 
be used for text similarity, such as the FuzzyScore ([lang]), and the 
CosineSimilarity (used by the Cosine edit distance). Others might be added 
later, such as the Jaccard Index. The behaviour of each of these classes varies

I think it is simpler, and users will quickly understand the API. Once one 
understands what is an edit distance, s/he can guess the behaviour of any of 
its implementations.
What do you think? If you agree I'd like to merge the branch and fix the issue.
TL;DR: the similarity package contains code to work on text similarity, such as 
edit distances, but also scores / indexes and other algorithms. The 
StringMetric interface has been renamed to EditDistance, and only edit 
distances implement it

TIA
Bruno

[1] https://issues.apache.org/jira/browse/SANDBOX-488[2] 
https://git1-us-west.apache.org/repos/asf?p=commons-text.git;a=tree;f=src/main/java/org/apache/commons/text/similarity;h=a2de9f0196b543f50c6d2c28376feb311f46eeda;hb=refs/heads/SANDBOX-488
 

      From: Benedikt Ritter <brit...@apache.org>
 To: Commons Developers List <dev@commons.apache.org>; Bruno P. Kinoshita 
<brunodepau...@yahoo.com.br> 
 Sent: Friday, December 19, 2014 2:35 AM
 Subject: Re: [TEXT] Distance vs. Metric vs. Similarity
   


2014-12-14 23:10 GMT+01:00 Bruno P. Kinoshita <brunodepau...@yahoo.com.br>:
> Sounds good, although I'm not sure I understand where you are going with> the 
> marker interface. What is it's purpose?
Let's then keep the StringMetric interface and update its Javadoc. Thinking 
again, that other marker interface seems to be unnecessary.  > Okay, but we 
need to make sure all algorithms really return a distance> then. As I said, 
FuzzyDistance currently really returns a similarity score.> An algorithm 
returning a distance should return a higher number for higher> distances. I had 
a look at the code, and I think I understand what you are saying now. In 
FuzzyDistance, the higher the score, the closer strings are. Different than 
what the other algorithms return.
I believe I found why I named that package similarity. Probably it was because 
I saw that in the stringmetric library [1]. There, Levenshtein, Jaccard and 
other algorithms are suffixed with "Metric".
How about we keep the package as similarity and simply rename the classes to 
[Algo]Metric too? This way we will be able to accommodate other metrics such as 
the Sorensen-Dice coefficient, where the higher the coefficient, more similar 
two strings are.
WDYT?



Hey Bruno,
yes we can do it that way. What I want to avoid is, that the users have to 
check the JavaDoc every time they use an algorithms. To me it would make sense 
to have a number of distance algorithms and they all return a distance. Or we 
have Similarity algorithms and they all return a similarity. That way users can 
swap out the underlying algorithms without changing their code.
Benedikt 
CheersBruno 
[1] https://github.com/rockymadden/stringmetric



      From: Benedikt Ritter <brit...@apache.org>
 To: Commons Developers List <dev@commons.apache.org>; Bruno P. Kinoshita 
<brunodepau...@yahoo.com.br>
 Sent: Sunday, December 14, 2014 6:45 PM
 Subject: Re: [TEXT] Distance vs. Metric vs. Similarity

Hi Bruna,



2014-12-14 21:37 GMT+01:00 Bruno P. Kinoshita <brunodepau...@yahoo.com.br>:
>
> Hello Benedikt!
> > Metric feels like it's something more general, but I'm not sure.
> You're right. Metric was supposed to be a general interface,
> representing the String Metric from the Wikipedia article.
> >  and the interface from StringMetric to StringDistance.
> I'm reading the Myers paper, and already have a local branch with the
> Myers algorithm from [collections] ported to [text].
> Perhaps we could move the StringMetric interface to o.a.c.text package,
> and create StringDistance or EditDistance interface in o.a.c.text.distance.
> This way we can have String Metrics as in Wikipedia, as being a way of
> giving a valuefor comparing two strings. We would have the edit distances
> in the distance package, and the diff algorithms in another diff package.
> All of them being String Metrics.
> What do you think?
>

Sounds good, although I'm not sure I understand where you are going with
the marker interface. What is it's purpose?


> > > I think we should consider renaming everything to distance, since
> the> > implemented algorithms all end on *Distance. So we would change the
> package> > name from o.a.c.text.similarity to o.a.c.text.distance and the
> interface> > from StringMetric to StringDistance.> >>
> > Looking at the code again, it seems like the algorithms all really
> return a> similarity score and not a distance. For exmaple FuzzyDistance
> JavaDoc> states: "A higher score indicates a higher similarity". If this is
> a case,> maybe it makes more sense to rename everything to Similarity?
> I'm in favor of dropping score and similarity, and adopting distance in
> the package, classes and javadocs, as it is used in other tools (e.g. Solr,
> Talend, Informatica IIR, etc).
>

Okay, but we need to make sure all algorithms really return a distance
then. As I said, FuzzyDistance currently really returns a similarity score.
An algorithm returning a distance should return a higher number for higher
distances.

Benedikt


> All the best,Bruno
>
>
>      From: Benedikt Ritter <brit...@apache.org>
>  To: Commons Developers List <dev@commons.apache.org>
>  Sent: Sunday, December 14, 2014 6:20 PM
>  Subject: Re: [TEXT] Distance vs. Metric vs. Similarity
>
> 2014-12-14 21:08 GMT+01:00 Benedikt Ritter <brit...@apache.org>:
> >
> > Hi,
> >
> > currently the wording in commons text is a bit confusing. We have the
> > three terms:
> >
> > - distance
> > - similarity
> > - metric
> >
> > Distance and similarity seem to be just opposites of the same thing. A
> > great distance indicates a small similarity between two character
> > sequences. Metric feels like it's something more general, but I'm not
> sure.
> >
> > I think we should consider renaming everything to distance, since the
> > implemented algorithms all end on *Distance. So we would change the
> package
> > name from o.a.c.text.similarity to o.a.c.text.distance and the interface
> > from StringMetric to StringDistance.
> >
>
> Looking at the code again, it seems like the algorithms all really return a
> similarity score and not a distance. For exmaple FuzzyDistance JavaDoc
> states: "A higher score indicates a higher similarity". If this is a case,
> maybe it makes more sense to rename everything to Similarity?
>
>
> >
> > WDYT?
> >
> > Benedikt
> >
> > --
> > http://people.apache.org/~britter/
> > http://www.systemoutprintln.de/
> > http://twitter.com/BenediktRitter
> > http://github.com/britter




>
>
> >
>
>
> --
> http://people.apache.org/~britter/
> http://www.systemoutprintln.de/
> http://twitter.com/BenediktRitter
> http://github.com/britter
>
>
>
>

--
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter






-- 
http://people.apache.org/~britter/http://www.systemoutprintln.de/http://twitter.com/BenediktRitterhttp://github.com/britter

Re: [TEXT] Distance vs. Metric vs. Similarity

Reply via email to