[
https://issues.apache.org/jira/browse/CODEC-125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080943#comment-13080943
]
Matthew Pocock commented on CODEC-125:
--------------------------------------
Hi,
* Would not it make sense to add surnames with accented chars to the
PhoneticEngineTest class? like: Schäffer (German), Győrössy (Hungarian),
Mészáros (Hungarian).
Yes. I'd love to see more names. I'm not a linguist of any kind so can only
work with those names people suggest.
* I know it won't increase the code coverage, but probably increase the
"resource coverage" if you know what I mean.
I know exactly what you're getting at. However, there are a great many rules.
It will be significant work to test each one of them.
*Something is still wrong with the performance.
*An interesting issue I see is that the current speed test uses almost 30MB of
memory creating 1.9m a Rule anonymous inner class instances (see attached.)
GC'ing these objects might explain the wild swings in performance.
*Wow. This must be due to lots of objects being generated. The #1 object
generate is String and #2 is our AppendableCharSequence.
My performance rewrite traded a lot of string creation for
AppendableCharSequence. This is because at each step, a processed prefix may
get applied to a rule that 'forks' it into a number of new alternatives. These
alternatives themselves may be 'forked' and so on. I can't think of a way to
reduce the number of these AppendableCharSequence objects. However, it may be
possible to reduce the per-instance cost and also to look at where all the
strings are coming from. Most of these things should be very short-lived, and
I'd hope that on Java7, some of them would get stack-inlined away.
I'm firing up my profiler in 'memory' mode - will get back to you if I have
progress.
> Implement a Beider-Morse phonetic matching codec
> ------------------------------------------------
>
> Key: CODEC-125
> URL: https://issues.apache.org/jira/browse/CODEC-125
> Project: Commons Codec
> Issue Type: New Feature
> Reporter: Matthew Pocock
> Priority: Minor
> Attachments: Rule$4$1-All_Objects.html, acz.patch, bm-gg.diff,
> bmpm.patch, bmpm.patch, bmpm.patch, bmpm.patch, bmpm.patch, bmpm.patch,
> bmpm.patch, bmpm.patch, fixmeInvariant.patch, handleH.patch, majorFix.patch,
> performanceAndBugs.patch, testEncodeGna.patch
>
>
> I have implemented Beider Morse Phonetic Matching as a codec against the
> commons-codec svn trunk. I would like to contribute this to commons-codec.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira