[
https://issues.apache.org/jira/browse/CODEC-84?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Niall Pemberton updated CODEC-84:
---------------------------------
Description:
The new test case (CODEC-83) has highlighted a number of issues with the
"alternative" encoding in the Double Metaphone implementation
1) Bug in the handleG method when "G" is followed by "IER"
* The alternative encoding of "Angier" results in "ANKR" rather than "ANJR"
* The alternative encoding of "rogier" results in "RKR" rather than "RJR"
The problem is in the handleG() method and is caused by the wrong length (4
instead of 3) being used in the contains() method:
{code}
} else if (contains(value, index + 1, 4, "IER")) {
{code}
...this should be
{code}
} else if (contains(value, index + 1, 3, "IER")) {
{code}
2) Bug in the handleL method
* The alternative encoding of "cabrillo" results in "KPRL " rather than "KPR"
The problem is that the first thing this method does is append an "L" to both
primary & alternative encoding. When the conditionL0() method returns true then
the "L" should not be appended for the alternative encoding
{code}
result.append('L');
if (charAt(value, index + 1) == 'L') {
if (conditionL0(value, index)) {
result.appendAlternate(' ');
}
index += 2;
} else {
index++;
}
return index;
{code}
Suggest refeactoring this to
{code}
if (charAt(value, index + 1) == 'L') {
if (conditionL0(value, index)) {
result.appendPrimary('L');
} else {
result.append('L');
}
index += 2;
} else {
result.append('L');
index++;
}
return index;
{code}
3) Bug in the conditionL0() method for words ending in "AS" and "OS"
* The alternative encoding of "gallegos" results in "KLKS" rather than "KKS"
The problem is caused by the wrong start position being used in the contains()
method, which means its not checking the last two characters of the word but
checks the previous & current position instead:
{code}
} else if ((contains(value, index - 1, 2, "AS", "OS") ||
{code}
...this should be
{code}
} else if ((contains(value, value.length() - 2, 2, "AS", "OS") ||
{code}
I'll attach a patch for review
was:
The new test case (CODEC-83) has highlighted a number of issues with the
"alternative" encoding in the Double Metaphone implementation
1) Bug in the handleG method when "G" is followed by "IER"
* The alternative encoding of "Angier" results in "ANKR" rather than "ANJR"
* The alternative encoding of "rogier" results in "RKR" rather than "RJR"
The problem is in the handleG() method and is caused by the wrong length (4
instead of 3) being used in the contains() method:
{code}
} else if (contains(value, index + 1, 4, "IER")) {
{code}
...this should be
{code}
} else if (contains(value, index + 1, 3, "IER")) {
{code}
2) Bug in the handleL method
* The alternative encoding of "cabrillo" results in "KPRL " rather than "KPR"
The problem is that the first thing this method does is append an "L" to both
primary & alternative encoding. When the conditionL0() method returns true then
the "L" should not be appended for the alternative encoding
{code}
result.append('L');
if (charAt(value, index + 1) == 'L') {
if (conditionL0(value, index)) {
result.appendAlternate(' ');
}
index += 2;
} else {
index++;
}
return index;
{code}
Suggest refeactoring this to
{code}
if (charAt(value, index + 1) == 'L') {
if (conditionL0(value, index)) {
result.appendPrimary('L');
} else {
result.append('L');
}
index += 2;
} else {
result.append('L');
index++;
}
return index;
{code}
3) Bug in the conditionL0() method for words ending in "AS" and "OS"
* The alternative encoding of "gallegos" results in "KLKS" rather than "KKS"
The problem is caused by the wrong start position being used in the contains()
method, which means its not checking the last two characters of the word but
checks the previous & current position instead:
{code}
} else if ((contains(value, index - 1, 2, "AS", "OS") ||
{code}
...this should be
{code}
} else if ((contains(value, value.length() - 2, 2, "AS", "OS") ||
{code}
I'll attach a patch for review
> Double Metaphone bugs in alternative encoding
> ---------------------------------------------
>
> Key: CODEC-84
> URL: https://issues.apache.org/jira/browse/CODEC-84
> Project: Commons Codec
> Issue Type: Bug
> Affects Versions: 1.3
> Reporter: Niall Pemberton
> Priority: Minor
> Fix For: 1.4
>
>
> The new test case (CODEC-83) has highlighted a number of issues with the
> "alternative" encoding in the Double Metaphone implementation
> 1) Bug in the handleG method when "G" is followed by "IER"
> * The alternative encoding of "Angier" results in "ANKR" rather than "ANJR"
> * The alternative encoding of "rogier" results in "RKR" rather than "RJR"
> The problem is in the handleG() method and is caused by the wrong length (4
> instead of 3) being used in the contains() method:
> {code}
> } else if (contains(value, index + 1, 4, "IER")) {
> {code}
> ...this should be
> {code}
> } else if (contains(value, index + 1, 3, "IER")) {
> {code}
> 2) Bug in the handleL method
> * The alternative encoding of "cabrillo" results in "KPRL " rather than "KPR"
> The problem is that the first thing this method does is append an "L" to both
> primary & alternative encoding. When the conditionL0() method returns true
> then the "L" should not be appended for the alternative encoding
> {code}
> result.append('L');
> if (charAt(value, index + 1) == 'L') {
> if (conditionL0(value, index)) {
> result.appendAlternate(' ');
> }
> index += 2;
> } else {
> index++;
> }
> return index;
> {code}
> Suggest refeactoring this to
> {code}
> if (charAt(value, index + 1) == 'L') {
> if (conditionL0(value, index)) {
> result.appendPrimary('L');
> } else {
> result.append('L');
> }
> index += 2;
> } else {
> result.append('L');
> index++;
> }
> return index;
> {code}
> 3) Bug in the conditionL0() method for words ending in "AS" and "OS"
> * The alternative encoding of "gallegos" results in "KLKS" rather than "KKS"
> The problem is caused by the wrong start position being used in the
> contains() method, which means its not checking the last two characters of
> the word but checks the previous & current position instead:
> {code}
> } else if ((contains(value, index - 1, 2, "AS", "OS") ||
> {code}
> ...this should be
> {code}
> } else if ((contains(value, value.length() - 2, 2, "AS", "OS") ||
> {code}
> I'll attach a patch for review
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.