[
https://issues.apache.org/jira/browse/CODEC-187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Thomas Neidhart updated CODEC-187:
----------------------------------
Attachment: CODEC-187.patch
Attached a patch that fixes the following:
* add entry in NOTICE.txt about the origin of the code
* fix rule files which still contained some php artifacts (namely $)
* added unit test for the case found by Michael
* added @see tags to the BeiderMorseEncoder with references to the homepage of
the original authors
* adapted unit tests as now more tokens are returned in some cases
There is still some difference to the original code in the sense that the
output of our encoder is sorted, while the original code does not sort the
output. Example: angelo
* original
{noformat}
angilo angYlo agilo ongilo ongYlo ogilo Yngilo YngYlo anxilo onxilo anilo onilo
aniilo oniilo anzilo onzilo
{noformat}
* ours
{noformat}
YngYlo Yngilo agilo angYlo angilo aniilo anilo anxilo anzilo ogilo ongYlo
ongilo oniilo onilo onxilo onzilo
{noformat}
The relevant code line is in PhoneticEngine.java line 339:
{noformat}
final Set<Rule.Phoneme> phonemes = new
TreeSet<Rule.Phoneme>(Rule.Phoneme.COMPARATOR);
{noformat}
changing this to a LinkedHashSet restores the original behavior, although I am
not sure if we should do this might introduce a regression in solr.
> Beider Morse Phonetic Matching producing incorrect tokens
> ---------------------------------------------------------
>
> Key: CODEC-187
> URL: https://issues.apache.org/jira/browse/CODEC-187
> Project: Commons Codec
> Issue Type: Bug
> Affects Versions: 1.9
> Reporter: michael tobias
> Priority: Minor
> Attachments: CODEC-187.patch
>
>
> I believe the Beider Morse Phonetic Matching algorithm was added in Commons
> Codec 1.6
> The BMPM algorithm is an EVOLVING algorithm that is currently on version 3.02
> though it had been static since version 3.01 dated 19 Dec 2011 (it was first
> available as opensource as version 1.00 on 6 May 2009).
> I can see nothing in the Commons Codec Docs to say which version of BMPM was
> implemented so I am not sure if the problem with the algorithm as coded in
> the Codec is simply an old version or whether there are more basic problems
> with the implementation.
> How do I determine the version of the algorithm that was implemented in the
> Commons Codec?
> How do we ensure that the algorithm is updated if/when the BMPM algorithm
> changes?
> How do we ensure that the algorithm as coded in the Commons Codec is accurate
> and working as expected?
--
This message was sent by Atlassian JIRA
(v6.2#6252)