[jira] [Commented] (CODEC-187) Beider Morse Phonetic Matching producing incorrect tokens

michael tobias (JIRA) Thu, 19 Jun 2014 23:54:06 -0700

    [ 
https://issues.apache.org/jira/browse/CODEC-187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038540#comment-14038540
 ]


michael tobias commented on CODEC-187:
--------------------------------------

I think Steve's page is misleading..... I am discussing with him.

hebrew should only be encoded when the hebrew alphabet is used.  Also there is 
no such concept as APPROX in hebrew - there is only EXACT. So basically ignore 
the APPROX/EXACT setting and use EXACT.  I am not sure if GENERIC / ASHKENAZI / 
SEPHARDIC is ignored too  - I am checking. 

Meantime for the hebrew name  (abram) אברם

BMPM should give 2 tokens 1brm and 1vrm.  Solr is currently giving 2 tokens - 
identical - of Lbrm.  So there is still a problem but possibly not as large as 
first thought.

I thought the java implementation always ensured no duplicate tokens were 
returned?

Very few sites use hebrew BMPM with Solr - I know only of 1.  I am going to 
contact them to see exactly how they are currently setup.

M


> Beider Morse Phonetic Matching producing incorrect tokens
> ---------------------------------------------------------
>
>                 Key: CODEC-187
>                 URL: https://issues.apache.org/jira/browse/CODEC-187
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.9
>            Reporter: michael tobias
>            Priority: Minor
>             Fix For: 1.10
>
>         Attachments: CODEC-187.patch, CODEC-187_ashkenazi_approx_any.patch, 
> CODEC-187_ashkenazi_approx_any_v2.patch
>
>
> I believe the Beider Morse Phonetic Matching algorithm was added in Commons 
> Codec 1.6
> The BMPM algorithm is an EVOLVING algorithm that is currently on version 3.02 
> though it had been static since version 3.01 dated 19 Dec 2011 (it was first 
> available as opensource as version 1.00 on 6 May 2009).
> I can see nothing in the Commons Codec Docs to say which version of BMPM was 
> implemented so I am not sure if the problem with the algorithm as coded in 
> the Codec is simply an old version or whether there are more basic problems 
> with the implementation.
> How do I determine the version of the algorithm that was implemented in the 
> Commons Codec?
> How do we ensure that the algorithm is updated if/when the BMPM algorithm 
> changes?
> How do we ensure that the algorithm as coded in the Commons Codec is accurate 
> and working as expected?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CODEC-187) Beider Morse Phonetic Matching producing incorrect tokens

Reply via email to