[GitHub] [commons-codec] kinow commented on a diff in pull request #189: CODEC-308: change NYSIIS encoding to not remove the first character i…

via GitHub Tue, 27 Jun 2023 11:58:12 -0700


kinow commented on code in PR #189:
URL: https://github.com/apache/commons-codec/pull/189#discussion_r1244213822



##########
src/test/java/org/apache/commons/codec/language/NysiisTest.java:
##########
@@ -140,7 +140,8 @@ public void testDropBy() throws EncoderException {
                 new String[] { "JILES", "JAL" },
                 // violates 6: if the last two characters are AY, remove A
                 new String[] { "CARRAWAY", "CARY" },       // Original: CARAY
-                new String[] { "YAMADA", "YANAD" });
+                new String[] { "YAMADA", "YANAD" },
+                new String[] { "ASH", "A"});

Review Comment:
   I couldn't find a C++ implementation nor a good source from Wikipedia. So I 
checked another implementation, `phonics` in R. It has a nysiis algorithm 
implementation that allows for "modified" key. For "A" it gives "" too, but the 
modified version gives "AS".
   
   ```R
   >install.packages('phonics')
   >library('phonics')
   >nysiis(c("ASH"))
   [1] ""
   > nysiis(c("ASH"), modified=TRUE)
   [1] "AS"
   > nysiis(c("ASHBURTON"))
   [1] "ASBART"
   ```
   
   Their 
[vignette](https://cloud.r-project.org/web/packages/phonics/phonics.pdf) (PDF) 
(a vignette is like a javadocs for an R package) documents the basic algorithm, 
but links to this PDF that explains the whole package a lot better: [James P. 
Howard, II, Phonetic Spelling Algorithm Implementations for 
R](https://www.jstatsoft.org/article/view/v095i08)
   
   
![image](https://github.com/apache/commons-codec/assets/304786/44bf589f-4e98-4cf2-8c24-fc099c665933)
   
   Searching more about those papers after the text above, I found this issue 
that describes the same thing I just said :grimacing: : 
[CODEC-235](https://issues.apache.org/jira/browse/CODEC-235)
   
   So my think we should document that the current version in Commons Codec is 
the one from the first paper, and then maybe add the other implementation as a 
separate class/method and let users to pick which one they want to use.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [commons-codec] kinow commented on a diff in pull request #189: CODEC-308: change NYSIIS encoding to not remove the first character i…

Reply via email to