[jira] [Commented] (CODEC-125) Implement a Beider-Morse phonetic matching codec

Matthew Pocock (JIRA) Mon, 08 Aug 2011 11:26:52 -0700

    [ 
https://issues.apache.org/jira/browse/CODEC-125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081106#comment-13081106
 ]


Matthew Pocock commented on CODEC-125:
--------------------------------------

After turning off all the 'helpful' filters, I've nailed the allocation 
hotspot. It's allocating literally millions of String instances, with the vast 
majority of them coming via:

  String.subSequence
  Rule$AppendableCharSequence.subSequence 
  CharSequence.subSequence
  patternAndContextMatches(lines 742,743,744)

Looking at these lines, we have:

  boolean patternMatches = input.subSequence(i, ipl).equals(this.pattern);
  boolean rContextMatches = this.rContext.matcher(input.subSequence(ipl, 
input.length())).find();
  boolean lContextMatches = this.lContext.matcher(input.subSequence(0, 
i)).find();

This is being called for each and every attempt at matching a rule to an offset 
within the charsequence. I have a hunch that many/most of these are splitting 
the string up into the same bits. So, we may be able to trade some memory churn 
for some memoisation/caching.

In PhoneticEngine, I've added cacheSubSequence() that returns a CharSequence 
that caches subSequence() calls. This has greatly reduced the number of String 
instances in play, pushing String way down the list. Actually, it's now the 
RMatcher instances generated in RPattern.pattern(Rule #508) used to match the 
left/right pattern to the input that are dominating.

The RPattern instances are made once when the rule is parsed from the resource. 
The nested RMatcher instances are made once per rule application. However, the 
RMatcher instances themselves are stateless once the input string to match 
against is known. So, I've added a FalseRMatcher instance and RMatcher 
matcherFor(boolean b) and entirely got rid of allocations here.

Now, there is no appreciable memory churn. For shorter inputs, HashMap$Entry 
dominates, which is due to LanguageSet.restrictTo being overly eager to make a 
new set for retainAll(). I've optimized this out also.

All the tests now run faster for me, with no noticeable memory churn. The new 
deterministic testSpeedCheck appears to be a pathological case that cause the 
algorithmic complexity to go 'boom'. However, I've added versions 2 and 3 of 
this that are for the alphabet, and an English phrase, and both of these appear 
to process quickly.

> Implement a Beider-Morse phonetic matching codec
> ------------------------------------------------
>
>                 Key: CODEC-125
>                 URL: https://issues.apache.org/jira/browse/CODEC-125
>             Project: Commons Codec
>          Issue Type: New Feature
>            Reporter: Matthew Pocock
>            Priority: Minor
>         Attachments: Rule$4$1-All_Objects.html, acz.patch, bm-gg.diff, 
> bmpm.patch, bmpm.patch, bmpm.patch, bmpm.patch, bmpm.patch, bmpm.patch, 
> bmpm.patch, bmpm.patch, fightingMemoryChurn.patch, fixmeInvariant.patch, 
> handleH.patch, majorFix.patch, performanceAndBugs.patch, 
> testAllChars-mem-profile.html, testEncodeGna.patch
>
>
> I have implemented Beider Morse Phonetic Matching as a codec against the 
> commons-codec svn trunk. I would like to contribute this to commons-codec.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CODEC-125) Implement a Beider-Morse phonetic matching codec

Reply via email to