[jira] [Commented] (CODEC-187) Beider Morse Phonetic Matching producing incorrect tokens

michael tobias (JIRA) Tue, 10 Jun 2014 18:54:25 -0700

    [ 
https://issues.apache.org/jira/browse/CODEC-187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027331#comment-14027331
 ]


michael tobias commented on CODEC-187:
--------------------------------------

In case this helps debugging....

here is a debug trace from Steve Morse' BMPM implementation of the algorithm 
for GENERIC, APPROX, autolanguage for the name "abram":


---------------------------------------------------------------------------------------------------------

applying language rules from (rulesany) to abram using languages 239840

char codes = [#61]a [#62]b [#72]r [#61]a [#6d]m

applying rule #248
   pattern=a
   lcontext=
   rcontext=
   subst=A
   result=A

applying rule #249
   pattern=b
   lcontext=
   rcontext=
   subst=B
   result=AB

applying rule #265
   pattern=r
   lcontext=
   rcontext=
   subst=r
   result=ABr

applying rule #248
   pattern=a
   lcontext=
   rcontext=
   subst=A
   result=ABrA

applying rule #27
   pattern=m
   lcontext=[aeiouy]
   rcontext=
   subst=(m|n[16448])
   result=(ABrAm[239840]|ABrAn[64])

after language rules: (ABrAm[239840]|ABrAn[64])


applying final rules from (exactapproxcommon plus approxcommon) to ABrAm[239840]
no rules match for phonetic item 0 at position 0: A
no rules match for phonetic item 0 at position 1: AB
no rules match for phonetic item 0 at position 2: ABr
no rules match for phonetic item 0 at position 3: ABrA
no rules match for phonetic item 0 at position 4: ABrAm

applying final rules from (exactapproxcommon plus approxcommon) to ABrAn[64]
no rules match for phonetic item 1 at position 0: A
no rules match for phonetic item 1 at position 1: AB
no rules match for phonetic item 1 at position 2: ABr
no rules match for phonetic item 1 at position 3: ABrA
no rules match for phonetic item 1 at position 4: ABrAn

applying final rules from (approxany) to ABrAm[239840]
after applying final rule #60 to phonetic item #0 at position 0: 
(a[239840]|o[239840]|Y[128]) pattern=A lcontext= rcontext= subst=(a|o|Y[128])
after applying final rule #3 to phonetic item #0 at position 1: 
(ab[239840]|av[131072]|ob[239840]|ov[131072]|Yb[128]) pattern=B lcontext= 
rcontext= subst=(b|v[131072])
no rules match for phonetic item 0 at position 2: 
(ab[239840]|av[131072]|ob[239840]|ov[131072]|Yb[128])r
after applying final rule #56 to phonetic item #0 at position 3: 
(abra[239840]|abro[239840]|avra[131072]|avro[131072]|obra[239840]|obro[239840]|ovra[131072]|ovro[131072]|Ybra[128]|Ybro[128])
 pattern=A lcontext= rcontext=[fklmnprst]$ subst=(a|o)
no rules match for phonetic item 0 at position 4: 
(abra[239840]|abro[239840]|avra[131072]|avro[131072]|obra[239840]|obro[239840]|ovra[131072]|ovro[131072]|Ybra[128]|Ybro[128])m

applying final rules from (approxany) to ABrAn[64]
after applying final rule #60 to phonetic item #1 at position 0: 
(a[239840]|o[239840]|Y[128]) pattern=A lcontext= rcontext= subst=(a|o|Y[128])
after applying final rule #3 to phonetic item #1 at position 1: 
(ab[239840]|av[131072]|ob[239840]|ov[131072]|Yb[128]) pattern=B lcontext= 
rcontext= subst=(b|v[131072])
no rules match for phonetic item 1 at position 2: 
(ab[239840]|av[131072]|ob[239840]|ov[131072]|Yb[128])r
after applying final rule #56 to phonetic item #1 at position 3: 
(abra[239840]|abro[239840]|avra[131072]|avro[131072]|obra[239840]|obro[239840]|ovra[131072]|ovro[131072]|Ybra[128]|Ybro[128])
 pattern=A lcontext= rcontext=[fklmnprst]$ subst=(a|o)
no rules match for phonetic item 1 at position 4: 
(abra[239840]|abro[239840]|avra[131072]|avro[131072]|obra[239840]|obro[239840]|ovra[131072]|ovro[131072]|Ybra[128]|Ybro[128])n


(abram|abrom|avram|avrom|obram|obrom|ovram|ovrom|Ybram|Ybrom|abran|abron|obran|obron)
 


> Beider Morse Phonetic Matching producing incorrect tokens
> ---------------------------------------------------------
>
>                 Key: CODEC-187
>                 URL: https://issues.apache.org/jira/browse/CODEC-187
>             Project: Commons Codec
>          Issue Type: Bug
>    Affects Versions: 1.9
>            Reporter: michael tobias
>            Priority: Minor
>
> I believe the Beider Morse Phonetic Matching algorithm was added in Commons 
> Codec 1.6
> The BMPM algorithm is an EVOLVING algorithm that is currently on version 3.02 
> though it had been static since version 3.01 dated 19 Dec 2011 (it was first 
> available as opensource as version 1.00 on 6 May 2009).
> I can see nothing in the Commons Codec Docs to say which version of BMPM was 
> implemented so I am not sure if the problem with the algorithm as coded in 
> the Codec is simply an old version or whether there are more basic problems 
> with the implementation.
> How do I determine the version of the algorithm that was implemented in the 
> Commons Codec?
> How do we ensure that the algorithm is updated if/when the BMPM algorithm 
> changes?
> How do we ensure that the algorithm as coded in the Commons Codec is accurate 
> and working as expected?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CODEC-187) Beider Morse Phonetic Matching producing incorrect tokens

Reply via email to