kinow commented on code in PR #189:
URL: https://github.com/apache/commons-codec/pull/189#discussion_r1244213822
##########
src/test/java/org/apache/commons/codec/language/NysiisTest.java:
##########
@@ -140,7 +140,8 @@ public void testDropBy() throws EncoderException {
new String[] { "JILES", "JAL" },
// violates 6: if the last two characters are AY, remove A
new String[] { "CARRAWAY", "CARY" }, // Original: CARAY
- new String[] { "YAMADA", "YANAD" });
+ new String[] { "YAMADA", "YANAD" },
+ new String[] { "ASH", "A"});
Review Comment:
I couldn't find a C++ implementation nor a good source from Wikipedia. So I
checked another implementation, `phonics` in R. It has a nysiis algorithm
implementation that allows for "modified" key. For "A" it gives "" too, but the
modified version gives "AS".
```R
>install.packages('phonics')
>library('phonics')
>nysiis(c("ASH"))
[1] ""
> nysiis(c("ASH"), modified=TRUE)
[1] "AS"
> nysiis(c("ASHBURTON"))
[1] "ASBART"
```
Their
[vignette](https://cloud.r-project.org/web/packages/phonics/phonics.pdf) (PDF)
(a vignette is like a javadocs for an R package) documents the basic algorithm,
but links to this PDF that explains the whole package a lot better: [James P.
Howard, II, Phonetic Spelling Algorithm Implementations for
R](https://www.jstatsoft.org/article/view/v095i08)

Searching more about those papers after the text above, I found this issue
that describes the same thing I just said :grimacing: :
[CODEC-235](https://issues.apache.org/jira/browse/CODEC-235)
So my think we should document that the current version in Commons Codec is
the one from the first paper, and then maybe add the other implementation as a
separate class/method and let users to pick which one they want to use.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]