[ 
https://issues.apache.org/jira/browse/CODEC-125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13083179#comment-13083179
 ] 

Matthew Pocock commented on CODEC-125:
--------------------------------------

The lucene/solr source seems not to reference StringEncoder. All references go 
via Encoder. In org.apache.lucene.analysis.phonetic.PhoneticFilter, it has the 
code:

String value = termAtt.toString();
String phonetic = null;
try {
 String v = encoder.encode(value).toString();
 if (v.length() > 0 && !value.equals(v)) phonetic = v;
} catch (Exception ignored) {} // just use the direct text

In org.apache.solr.analysis.PhoneticFilterFactory, there's a registry map of 
encoders listing the (known) StringEncoder instances but typed to Encoder.

I agree that encoding sets of results as single strings is not ideal, and in an 
ideal world would prefer to see CharSequence -> Set<CharSequence> as the 
interface. This would not only be more correct, but also lower the 
computational overhead as we can use HashSet internally throughout, which is 
much more efficient for working with string-keyed sets than TreeSet is. This 
encoder is not unique in producing multiple outputs - it's a feature shared by 
most sounds-like phonetic encodings, and if you look at the lucene code above, 
all of these endoding strategies will not be handled gracefully by lucene.

The output strings right now are alphabetised, so the resulting string is 
always in canonical form. It is not possible for it to produce different 
strings for the same set.

The timeline I'd prefer to see is:

a) a release of commons-codec with the bmpm code, and work with lucene to 
provide a follow-on lucene/solr releases using this, so as to make the codec 
available at the earliest opportunity to people

b) bump a big version number for codec and make it fully Java 5 compliant, with 
generics throughout both the internals and public interfaces, together with any 
refactoring that may entail

c) encourage lucene/solr to put out a further release against this, handling 
multiple encodings, and if needed helping them to adapt to the new interfaces

> Implement a Beider-Morse phonetic matching codec
> ------------------------------------------------
>
>                 Key: CODEC-125
>                 URL: https://issues.apache.org/jira/browse/CODEC-125
>             Project: Commons Codec
>          Issue Type: New Feature
>            Reporter: Matthew Pocock
>            Priority: Minor
>         Attachments: Rule$4$1-All_Objects.html, acz.patch, bm-gg.diff, 
> bmpm.patch, bmpm.patch, bmpm.patch, bmpm.patch, bmpm.patch, bmpm.patch, 
> bmpm.patch, bmpm.patch, comparator.patch, fightingMemoryChurn.patch, 
> fightingMemoryChurn.patch, fixmeInvariant.patch, handleH.patch, 
> majorFix.patch, performanceAndBugs.patch, testAllChars-mem-profile.html, 
> testEncodeGna.patch
>
>
> I have implemented Beider Morse Phonetic Matching as a codec against the 
> commons-codec svn trunk. I would like to contribute this to commons-codec.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to