Alex D Herbert created TEXT-157:
-----------------------------------

             Summary: Remove rounding from JaccardSimilarity and Distance
                 Key: TEXT-157
                 URL: https://issues.apache.org/jira/browse/TEXT-157
             Project: Commons Text
          Issue Type: Improvement
    Affects Versions: 1.6
            Reporter: Alex D Herbert
            Assignee: Alex D Herbert


The {{JaccardSimilarity}} uses rounding to 2 decimal places. This prevents 
detection of dissimilar sequences of even moderately short length.

Using sequences with 1 or 2 characters in common and the remaining characters 
are different:
{noformat}
 2 0.500000 1.000000 : aa vs (ab or aa)
 3 0.250000 0.330000 : aaD vs (abd or aaÀ)
 4 0.170000 0.200000 : aaDE vs (abde or aaÀÁ)
 5 0.130000 0.140000 : aaDEF vs (abdef or aaÀÁÂ)
 6 0.100000 0.110000 : aaDEFG vs (abdefg or aaÀÁÂÃ)
 7 0.080000 0.090000 : aaDEFGH vs (abdefgh or aaÀÁÂÃÄ)
 8 0.070000 0.080000 : aaDEFGHI vs (abdefghi or aaÀÁÂÃÄÅ)
 9 0.060000 0.070000 : aaDEFGHIJ vs (abdefghij or aaÀÁÂÃÄÅÆ)
10 0.060000 0.060000 : aaDEFGHIJK vs (abdefghijk or aaÀÁÂÃÄÅÆÇ)
{noformat}
Without rounding the scores are different where previously rounding had 
produced the same score:
{noformat}
 2 0.500000 1.000000 : aa vs (ab or aa)
 3 0.250000 0.333333 : aaD vs (abd or aaÀ)
 4 0.166667 0.200000 : aaDE vs (abde or aaÀÁ)
 5 0.125000 0.142857 : aaDEF vs (abdef or aaÀÁÂ)
 6 0.100000 0.111111 : aaDEFG vs (abdefg or aaÀÁÂÃ)
 7 0.083333 0.090909 : aaDEFGH vs (abdefgh or aaÀÁÂÃÄ)
 8 0.071429 0.076923 : aaDEFGHI vs (abdefghi or aaÀÁÂÃÄÅ)
 9 0.062500 0.066667 : aaDEFGHIJ vs (abdefghij or aaÀÁÂÃÄÅÆ)
10 0.055556 0.058824 : aaDEFGHIJK vs (abdefghijk or aaÀÁÂÃÄÅÆÇ)
{noformat}
 Generated using:
{code:java}
@Test
public void roundingDemo() {
    // First character of each dissimilar sequence.
    // Chosen for a nice output where we already know the loop
    // will exit before sequence overlap.
    char ch1 = 'D';
    char ch2 = 'd';
    char ch3 = 0x00c0;
    // 1 or 2 characters in common
    StringBuilder sb1 = new StringBuilder("aa");
    StringBuilder sb2 = new StringBuilder("ab"); // 1 in common
    StringBuilder sb3 = new StringBuilder("aa"); // 2 in common
    JaccardSimilarity similarity = new JaccardSimilarity();
    // Extend the sequences until a single/double character 
    // similarity cannot be detected
    double j1, j2;
    do  {
        j1 = similarity.apply(sb1, sb2);
        j2 = similarity.apply(sb1, sb3);
        System.out.printf("%2d %f %f : %s vs (%s or %s)%n", 
                          sb1.length(), j1, j2, sb1, sb2, sb3);
        // Extend the sequence using unique characters for each
        sb1.append(ch1++);
        sb2.append(ch2++);
        sb3.append(ch3++);

        // Note: Check length since the sequences will overlap
        // in case the rounding is not present
    } while (j1 != j2 && sb1.length() < 26); 
}

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to