[jira] [Updated] (TEXT-131) JaroWinklerDistance: Calculation deviates from definition
[ https://issues.apache.org/jira/browse/TEXT-131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pascal Schumacher updated TEXT-131: --- Affects Version/s: 1.4 > JaroWinklerDistance: Calculation deviates from definition > - > > Key: TEXT-131 > URL: https://issues.apache.org/jira/browse/TEXT-131 > Project: Commons Text > Issue Type: Bug >Affects Versions: 1.4 >Reporter: Jan Martin Keil >Assignee: Rob Tompkins >Priority: Major > > The calculation in {{JaroWinklerDistance}} deviates from the definition of > the Jaro-Winkler Similarity. By definition the common prefix length is only > determine for the first 4 characters. Further, the JaroWinkler is defined as > {{JaroSimilarity + ScalingFactor * CommonPrefixLength * (1 - JaroSimilarity > )}}. > Therefore, I recommend the following changes: > # Update Jaro-Winkler Similarity calculation > {code:java} > final double jw = j < 0.7D ? j : j + Math.min(defaultScalingFactor, 1D / > mtp[3]) * mtp[2] * (1D - j); > {code} > to > {code:java} > final double jw = j < 0.7D ? j : j + defaultScalingFactor * mtp[2] * (1D - j); > {code} > # Update calculation of Common Prefix Length > {code:java} > for (int mi = 0; mi < min.length(); mi++) { > {code} > to > {code:java} > for (int mi = 0; mi < Math.min(4, min.length()); mi++) { > {code} > # Remove unnecessary return value > {code:java} > return new int[] {matches, transpositions, prefix, max.length()}; > {code} > to > {code:java} > return new int[] {matches, transpositions, prefix}; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (TEXT-131) JaroWinklerDistance: Calculation deviates from definition
[ https://issues.apache.org/jira/browse/TEXT-131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rob Tompkins updated TEXT-131: -- Assignee: Rob Tompkins > JaroWinklerDistance: Calculation deviates from definition > - > > Key: TEXT-131 > URL: https://issues.apache.org/jira/browse/TEXT-131 > Project: Commons Text > Issue Type: Bug >Reporter: Jan Martin Keil >Assignee: Rob Tompkins >Priority: Major > > The calculation in {{JaroWinklerDistance}} deviates from the definition of > the Jaro-Winkler Similarity. By definition the common prefix length is only > determine for the first 4 characters. Further, the JaroWinkler is defined as > {{JaroSimilarity + ScalingFactor * CommonPrefixLength * (1 - JaroSimilarity > )}}. > Therefore, I recommend the following changes: > # Update Jaro-Winkler Similarity calculation > {code:java} > final double jw = j < 0.7D ? j : j + Math.min(defaultScalingFactor, 1D / > mtp[3]) * mtp[2] * (1D - j); > {code} > to > {code:java} > final double jw = j < 0.7D ? j : j + defaultScalingFactor * mtp[2] * (1D - j); > {code} > # Update calculation of Common Prefix Length > {code:java} > for (int mi = 0; mi < min.length(); mi++) { > {code} > to > {code:java} > for (int mi = 0; mi < Math.min(4, min.length()); mi++) { > {code} > # Remove unnecessary return value > {code:java} > return new int[] {matches, transpositions, prefix, max.length()}; > {code} > to > {code:java} > return new int[] {matches, transpositions, prefix}; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)