[jira] [Commented] (CODEC-125) Implement a Beider-Morse phonetic matching codec

Matthew Pocock (JIRA) Fri, 01 Jul 2011 07:58:53 -0700

    [ 
https://issues.apache.org/jira/browse/CODEC-125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058586#comment-13058586
 ]


Matthew Pocock commented on CODEC-125:
--------------------------------------

I have renamed the bmpm package to bm. Do you want me to move 
BeiderMoreseEncoder into the bm package? I put it into the language package 
because that is where all the other encoders are, and I presume having them in 
that package allows them to be automagically imported by things like the lucene 
configuration files. However, I put all the other stuff in bm because it is 
specific to the bmpm method and is worth having publicly visible as you can do 
some custom things with it that are not reasonable to expose through the codec. 
It also has no relevance to the other codecs so I didn't want to clutter up the 
primary package.

So, I've applied the patch on ubuntu to a clean checkout of commons-codec. This 
failed to pass all tests because all empty files in the patch failed to 
generate empty files in the source tree. I did not know that patch behaved like 
this. Anyway, I've put a comment in every otherwise empty file and now on 
ubuntu the patch applies cleanly to commons-codec and results in a project that 
builds without errors.

Then I've made a clean checkout of commons-codec on windows 7 and applied the 
revised patch using TortoiseSvn. When I build this, I get errors. It looks like 
windows is mangling the unicode text files during application of the patch. You 
said that you where seeing '?' characters in the text files. There are no such 
characters in the original text or in the patch file, so I think this is 
indicating that the text has got mangled during patch application. After 
applying the patch on windows using tortoiseSvn, in lang.txt I see ? for each 
cyrillic, greek, hebrew and arabic characters. In the original file on windows 
I see various symbols. When I look at the patch file directly in windows, I see 
symbols. I've looked at lang.txt in the TortoiseMerge tool, and regardless of 
what I set the default encoding to, the interesting unicode chars are mangled 
to '?'.

I've run out of ideas about how to apply the patch on windows. What tool where 
you using to apply the patch? Can you tell it that the patch file is UTF8?



> Implement a Beider-Morse phonetic matching codec
> ------------------------------------------------
>
>                 Key: CODEC-125
>                 URL: https://issues.apache.org/jira/browse/CODEC-125
>             Project: Commons Codec
>          Issue Type: New Feature
>            Reporter: Matthew Pocock
>            Priority: Minor
>         Attachments: bm-gg.diff, bmpm.patch, bmpm.patch
>
>
> I have implemented Beider Morse Phonetic Matching as a codec against the 
> commons-codec svn trunk. I would like to contribute this to commons-codec.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CODEC-125) Implement a Beider-Morse phonetic matching codec

Reply via email to