[
https://issues.apache.org/jira/browse/CODEC-125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13083179#comment-13083179
]
Matthew Pocock commented on CODEC-125:
--------------------------------------
The lucene/solr source seems not to reference StringEncoder. All references go
via Encoder. In org.apache.lucene.analysis.phonetic.PhoneticFilter, it has the
code:
String value = termAtt.toString();
String phonetic = null;
try {
String v = encoder.encode(value).toString();
if (v.length() > 0 && !value.equals(v)) phonetic = v;
} catch (Exception ignored) {} // just use the direct text
In org.apache.solr.analysis.PhoneticFilterFactory, there's a registry map of
encoders listing the (known) StringEncoder instances but typed to Encoder.
I agree that encoding sets of results as single strings is not ideal, and in an
ideal world would prefer to see CharSequence -> Set<CharSequence> as the
interface. This would not only be more correct, but also lower the
computational overhead as we can use HashSet internally throughout, which is
much more efficient for working with string-keyed sets than TreeSet is. This
encoder is not unique in producing multiple outputs - it's a feature shared by
most sounds-like phonetic encodings, and if you look at the lucene code above,
all of these endoding strategies will not be handled gracefully by lucene.
The output strings right now are alphabetised, so the resulting string is
always in canonical form. It is not possible for it to produce different
strings for the same set.
The timeline I'd prefer to see is:
a) a release of commons-codec with the bmpm code, and work with lucene to
provide a follow-on lucene/solr releases using this, so as to make the codec
available at the earliest opportunity to people
b) bump a big version number for codec and make it fully Java 5 compliant, with
generics throughout both the internals and public interfaces, together with any
refactoring that may entail
c) encourage lucene/solr to put out a further release against this, handling
multiple encodings, and if needed helping them to adapt to the new interfaces
> Implement a Beider-Morse phonetic matching codec
> ------------------------------------------------
>
> Key: CODEC-125
> URL: https://issues.apache.org/jira/browse/CODEC-125
> Project: Commons Codec
> Issue Type: New Feature
> Reporter: Matthew Pocock
> Priority: Minor
> Attachments: Rule$4$1-All_Objects.html, acz.patch, bm-gg.diff,
> bmpm.patch, bmpm.patch, bmpm.patch, bmpm.patch, bmpm.patch, bmpm.patch,
> bmpm.patch, bmpm.patch, comparator.patch, fightingMemoryChurn.patch,
> fightingMemoryChurn.patch, fixmeInvariant.patch, handleH.patch,
> majorFix.patch, performanceAndBugs.patch, testAllChars-mem-profile.html,
> testEncodeGna.patch
>
>
> I have implemented Beider Morse Phonetic Matching as a codec against the
> commons-codec svn trunk. I would like to contribute this to commons-codec.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira