[
https://issues.apache.org/jira/browse/CODEC-125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081106#comment-13081106
]
Matthew Pocock commented on CODEC-125:
--------------------------------------
After turning off all the 'helpful' filters, I've nailed the allocation
hotspot. It's allocating literally millions of String instances, with the vast
majority of them coming via:
String.subSequence
Rule$AppendableCharSequence.subSequence
CharSequence.subSequence
patternAndContextMatches(lines 742,743,744)
Looking at these lines, we have:
boolean patternMatches = input.subSequence(i, ipl).equals(this.pattern);
boolean rContextMatches = this.rContext.matcher(input.subSequence(ipl,
input.length())).find();
boolean lContextMatches = this.lContext.matcher(input.subSequence(0,
i)).find();
This is being called for each and every attempt at matching a rule to an offset
within the charsequence. I have a hunch that many/most of these are splitting
the string up into the same bits. So, we may be able to trade some memory churn
for some memoisation/caching.
In PhoneticEngine, I've added cacheSubSequence() that returns a CharSequence
that caches subSequence() calls. This has greatly reduced the number of String
instances in play, pushing String way down the list. Actually, it's now the
RMatcher instances generated in RPattern.pattern(Rule #508) used to match the
left/right pattern to the input that are dominating.
The RPattern instances are made once when the rule is parsed from the resource.
The nested RMatcher instances are made once per rule application. However, the
RMatcher instances themselves are stateless once the input string to match
against is known. So, I've added a FalseRMatcher instance and RMatcher
matcherFor(boolean b) and entirely got rid of allocations here.
Now, there is no appreciable memory churn. For shorter inputs, HashMap$Entry
dominates, which is due to LanguageSet.restrictTo being overly eager to make a
new set for retainAll(). I've optimized this out also.
All the tests now run faster for me, with no noticeable memory churn. The new
deterministic testSpeedCheck appears to be a pathological case that cause the
algorithmic complexity to go 'boom'. However, I've added versions 2 and 3 of
this that are for the alphabet, and an English phrase, and both of these appear
to process quickly.
> Implement a Beider-Morse phonetic matching codec
> ------------------------------------------------
>
> Key: CODEC-125
> URL: https://issues.apache.org/jira/browse/CODEC-125
> Project: Commons Codec
> Issue Type: New Feature
> Reporter: Matthew Pocock
> Priority: Minor
> Attachments: Rule$4$1-All_Objects.html, acz.patch, bm-gg.diff,
> bmpm.patch, bmpm.patch, bmpm.patch, bmpm.patch, bmpm.patch, bmpm.patch,
> bmpm.patch, bmpm.patch, fightingMemoryChurn.patch, fixmeInvariant.patch,
> handleH.patch, majorFix.patch, performanceAndBugs.patch,
> testAllChars-mem-profile.html, testEncodeGna.patch
>
>
> I have implemented Beider Morse Phonetic Matching as a codec against the
> commons-codec svn trunk. I would like to contribute this to commons-codec.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira