[ 
https://issues.apache.org/jira/browse/TEXT-157?focusedWorklogId=210107&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-210107
 ]

ASF GitHub Bot logged work on TEXT-157:
---------------------------------------

                Author: ASF GitHub Bot
            Created on: 08/Mar/19 12:16
            Start Date: 08/Mar/19 12:16
    Worklog Time Spent: 10m 
      Work Description: aherbert commented on pull request #111: TEXT-157: 
Remove rounding from JaccardSimilarity and Distance
URL: https://github.com/apache/commons-text/pull/111#discussion_r263758068
 
 

 ##########
 File path: 
src/test/java/org/apache/commons/text/similarity/JaccardDistanceTest.java
 ##########
 @@ -36,21 +36,23 @@ public static void setUp() {
 
     @Test
     public void testGettingJaccardDistance() {
-        assertEquals(1.00d, classBeingTested.apply("", ""), 
0.00000000000000000001d);
-        assertEquals(1.00d, classBeingTested.apply("left", ""), 
0.00000000000000000001d);
-        assertEquals(1.00d, classBeingTested.apply("", "right"), 
0.00000000000000000001d);
-        assertEquals(0.25d, classBeingTested.apply("frog", "fog"), 
0.00000000000000000001d);
-        assertEquals(1.00d, classBeingTested.apply("fly", "ant"), 
0.00000000000000000001d);
-        assertEquals(0.78d, classBeingTested.apply("elephant", "hippo"), 
0.00000000000000000001d);
-        assertEquals(0.36d, classBeingTested.apply("ABC Corporation", "ABC 
Corp"), 0.00000000000000000001d);
-        assertEquals(0.24d, classBeingTested.apply("D N H Enterprises Inc", "D 
& H Enterprises, Inc."),
-                0.00000000000000000001d);
-        assertEquals(0.11d, classBeingTested.apply("My Gym Children's Fitness 
Center", "My Gym. Childrens Fitness"),
-                0.00000000000000000001d);
-        assertEquals(0.10d, classBeingTested.apply("PENNSYLVANIA", 
"PENNCISYLVNIA"), 0.00000000000000000001d);
-        assertEquals(0.87d, classBeingTested.apply("left", "right"), 
0.00000000000000000001d);
-        assertEquals(0.87d, classBeingTested.apply("leettteft", "ritttght"), 
0.00000000000000000001d);
-        assertEquals(0.0d, classBeingTested.apply("the same string", "the same 
string"), 0.00000000000000000001d);
+        // Results generated using the python distance library using:
+        // distance.jaccard(seq1, seq2)
+        assertEquals(1.0, classBeingTested.apply("", ""));
+        assertEquals(1.0, classBeingTested.apply("left", ""));
+        assertEquals(1.0, classBeingTested.apply("", "right"));
+        assertEquals(0.25, classBeingTested.apply("frog", "fog"));
+        assertEquals(1.0, classBeingTested.apply("fly", "ant"));
+        assertEquals(0.7777777777777778, classBeingTested.apply("elephant", 
"hippo"));
 
 Review comment:
   I understand where you are coming from. However it currently passes the test 
because the underlying math is using `intersect / divide` with integers cast to 
double. So the value can be exact and there is not really a way to do it with 
less precision unless the algorithm is wrong.
   
   I can update to using an epsilon of 1e-10 for all values that are not 
expected to be 0.0 or 1.0. However that does make it possible someone reading 
the test will think the output value is subject to imprecision because it is 
not.
   
   How about a comment in the test explaining where each value comes from, or 
even the actual computation:
   
   ```
   assertEquals(0.7777777777777778, classBeingTested.apply("elephant", 
"hippo"));
   // becomes
   assertEquals(7.0 / 9.0, classBeingTested.apply("elephant", "hippo"));
   ```
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 210107)
    Time Spent: 40m  (was: 0.5h)

> Remove rounding from JaccardSimilarity and Distance to improve ranking
> ----------------------------------------------------------------------
>
>                 Key: TEXT-157
>                 URL: https://issues.apache.org/jira/browse/TEXT-157
>             Project: Commons Text
>          Issue Type: Improvement
>    Affects Versions: 1.6
>            Reporter: Alex D Herbert
>            Assignee: Alex D Herbert
>            Priority: Trivial
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> The {{JaccardSimilarity}} uses rounding to 2 decimal places. This prevents 
> ranking of dissimilar sequences of even moderately short length.
> Using sequences with 1 or 2 characters in common and the remaining characters 
> are different:
> {noformat}
>  2 0.500000 1.000000 : aa vs (ab or aa)
>  3 0.250000 0.330000 : aaD vs (abd or aaÀ)
>  4 0.170000 0.200000 : aaDE vs (abde or aaÀÁ)
>  5 0.130000 0.140000 : aaDEF vs (abdef or aaÀÁÂ)
>  6 0.100000 0.110000 : aaDEFG vs (abdefg or aaÀÁÂÃ)
>  7 0.080000 0.090000 : aaDEFGH vs (abdefgh or aaÀÁÂÃÄ)
>  8 0.070000 0.080000 : aaDEFGHI vs (abdefghi or aaÀÁÂÃÄÅ)
>  9 0.060000 0.070000 : aaDEFGHIJ vs (abdefghij or aaÀÁÂÃÄÅÆ)
> 10 0.060000 0.060000 : aaDEFGHIJK vs (abdefghijk or aaÀÁÂÃÄÅÆÇ)
> {noformat}
> Without rounding the scores are different where previously rounding had 
> produced the same score. This will improve ranking:
> {noformat}
>  2 0.500000 1.000000 : aa vs (ab or aa)
>  3 0.250000 0.333333 : aaD vs (abd or aaÀ)
>  4 0.166667 0.200000 : aaDE vs (abde or aaÀÁ)
>  5 0.125000 0.142857 : aaDEF vs (abdef or aaÀÁÂ)
>  6 0.100000 0.111111 : aaDEFG vs (abdefg or aaÀÁÂÃ)
>  7 0.083333 0.090909 : aaDEFGH vs (abdefgh or aaÀÁÂÃÄ)
>  8 0.071429 0.076923 : aaDEFGHI vs (abdefghi or aaÀÁÂÃÄÅ)
>  9 0.062500 0.066667 : aaDEFGHIJ vs (abdefghij or aaÀÁÂÃÄÅÆ)
> 10 0.055556 0.058824 : aaDEFGHIJK vs (abdefghijk or aaÀÁÂÃÄÅÆÇ)
> {noformat}
>  Generated using:
> {code:java}
> @Test
> public void roundingDemo() {
>     // First character of each dissimilar sequence.
>     // Chosen for a nice output where we already know the loop
>     // will exit before sequence overlap.
>     char ch1 = 'D';
>     char ch2 = 'd';
>     char ch3 = 0x00c0;
>     // 1 or 2 characters in common
>     StringBuilder sb1 = new StringBuilder("aa");
>     StringBuilder sb2 = new StringBuilder("ab"); // 1 in common
>     StringBuilder sb3 = new StringBuilder("aa"); // 2 in common
>     JaccardSimilarity similarity = new JaccardSimilarity();
>     // Extend the sequences until a single/double character 
>     // similarity cannot be detected
>     double j1, j2;
>     do  {
>         j1 = similarity.apply(sb1, sb2);
>         j2 = similarity.apply(sb1, sb3);
>         System.out.printf("%2d %f %f : %s vs (%s or %s)%n", 
>                           sb1.length(), j1, j2, sb1, sb2, sb3);
>         // Extend the sequence using unique characters for each
>         sb1.append(ch1++);
>         sb2.append(ch2++);
>         sb3.append(ch3++);
>         // Note: Check length since the sequences will overlap
>         // in case the rounding is not present
>     } while (j1 != j2 && sb1.length() < 26); 
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to