[
https://issues.apache.org/jira/browse/CODEC-248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16570937#comment-16570937
]
Ben Kazez commented on CODEC-248:
---------------------------------
================ Ben
Right now GIERSZLIK is matching GOTSALK. This is because Gierszlik is
coding to 548500 or 594850, and GOTSALK is 548500. According to
Morse's site, Gierszlik should code to just 594850. I believe the
confusion is about this:
> When adjacent sounds can combine to form a larger sound, they are given the
> code number of the larger sound
SZ and RS are both "larger sounds." One listing of rules I found
online says that the tokens must be matched in order, which means that
"RSZ" would be interpreted as "R SZ" instead of "RS Z". That makes
sense to me, but I didn't find any mention of that on Avotaynu after
some brief searching.
Is there some official standard for D-M rules? What does it say about
when two "larger sound" interpretations are possible?
Many thanks!
Ben
================ Gary
Ben:
I would drop RS from the table. Randy Daitch created the table and I cannot
think of any language where RS is pronounced "S" (4).
Gary
> language.DaitchMokotoffSoundex gives overly broad results for tokens
> containing RS
> ----------------------------------------------------------------------------------
>
> Key: CODEC-248
> URL: https://issues.apache.org/jira/browse/CODEC-248
> Project: Commons Codec
> Issue Type: Bug
> Reporter: Ben Kazez
> Priority: Minor
>
> I am using Apache commons codec in Elasticsearch (via Lucene).
> # GIERSZLIK codes to 548500 or 594850
> # GOTSALK codes to 548500
> # These names don't sound alike, but the matching codes means a search for
> one returns the other.
> Solution: I exchanged emails with Gary Mokotoff, co-creator of the algorithm,
> who said:
> {quote}I would drop RS from the table. ... I cannot think of any language
> where RS is pronounced "S" (4).{quote}
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)