[
https://issues.apache.org/jira/browse/TEXT-158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167567#comment-17167567
]
Bruno P. Kinoshita commented on TEXT-158:
-----------------------------------------
>I would vote for an empty string to be perfectly similar to another empty
>string.
>This is inline with other similarity scores in the library, e.g.
>{{JaroWinklerSimilarity}}.
>So this is a bug.
+1 agreed. Let me see how hard would be to fix this...
> Incorrect values for Jaccard similarity with empty strings
> ----------------------------------------------------------
>
> Key: TEXT-158
> URL: https://issues.apache.org/jira/browse/TEXT-158
> Project: Commons Text
> Issue Type: Bug
> Affects Versions: 1.6
> Reporter: Bruno P. Kinoshita
> Priority: Minor
> Fix For: 1.9.1
>
>
> In a discussion part of TEXT-126, it was
> [pointed|https://github.com/apache/commons-text/pull/103#discussion_r263988298]
> that the Jaccard similarity returns 0.0, and the distance 1.0. While in
> other libraries it returns the opposite for each.
> {code:java}
> package br.eti.kinoshita.tests.text;
> import java.util.Collections;
> public class EditDistances {
> public static void main(String[] args) {
> System.out.println("Testing jaccard sim/dis with empty strings");
> System.out.println("---");
> org.simmetrics.metrics.Jaccard<String> j1 = new
> org.simmetrics.metrics.Jaccard<>();
> float s1 = j1.compare(Collections.emptySet(), Collections.emptySet());
> System.out.println("Simmetrics Jaccard similarity: " + s1);
> float d1 = j1.distance(Collections.emptySet(),
> Collections.emptySet());
> System.out.println("Simmetrics Jaccard distance: " + d1);
>
> System.out.println("---");
>
> info.debatty.java.stringsimilarity.Jaccard j2 = new
> info.debatty.java.stringsimilarity.Jaccard();
> double s2 = j2.similarity("", "");
> System.out.println("javastringsimilarity Jaccard similarity: " + s2);
> double d2 = j2.distance("", "");
> System.out.println("javastringsimilarity Jaccard distance: " + d2);
>
> System.out.println("---");
>
> org.apache.commons.text.similarity.JaccardSimilarity j3_1 = new
> org.apache.commons.text.similarity.JaccardSimilarity();
> double s3 = j3_1.apply("", "");
> System.out.println("commons-text Jaccard similarity: " + s3);
> org.apache.commons.text.similarity.JaccardDistance j3_2 = new
> org.apache.commons.text.similarity.JaccardDistance();
> double d3 = j3_2.apply("", "");
> System.out.println("commons-text Jaccard distance: " + d3);
> }
> }{code}
> Produces:
> {noformat}
> Testing jaccard sim/dis with empty strings
> ---
> Simmetrics Jaccard similarity: 1.0
> Simmetrics Jaccard distance: 0.0
> ---
> javastringsimilarity Jaccard similarity: 1.0
> javastringsimilarity Jaccard distance: 0.0
> ---
> commons-text Jaccard similarity: 0.0
> commons-text Jaccard distance: 1.0{noformat}
> We need to confirm what's the correct output for similarity and distance with
> empty strings. And either document why we are returning what we are
> returning, or fix it as a bug for the next release.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)