[
https://issues.apache.org/jira/browse/CODEC-132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221933#comment-13221933
]
Thomas Neidhart commented on CODEC-132:
---------------------------------------
I digged into this problem, and it is not related to punctuation or other
special characters.
There are some generic rules defined, that blow up the set of possible
phenomes, e.g.:
"a" "" "" "(e|o|a)" // hat | call | part
Considering you provide random data as input, this single rule will match most
likely every single 'a' in the input, and triple the set of phenomes at every
occasion. This leads quickly to very large sets and to OOMs of course.
I would not consider touching the rules, but instead include a parameter to the
PhoneticEngine that defines how many different phonemes I want in the result as
a maximum. Limiting the number of new phenomes in PhenomeBuilder.apply to this
maximum.
For normal text, the number of phenomes is usually small anyway, so a default
of 20 sounds reasonable, but should be user-controllable.
btw. you could also consider using setting the parameter concat to false, in
that case each word is treated separately which should mitigate the problem a
bit, as single words are smaller and thus do not suffer so much from the
phenome explosion.
> BeiderMorseEncoder OOM issues
> -----------------------------
>
> Key: CODEC-132
> URL: https://issues.apache.org/jira/browse/CODEC-132
> Project: Commons Codec
> Issue Type: Bug
> Affects Versions: 1.6
> Reporter: Robert Muir
> Attachments: CODEC-132_test.patch
>
>
> In Lucene/Solr, we integrated this encoder into the latest release.
> Our tests use a variety of random strings, and we have recent jenkins failures
> from some input streams (of length <= 10), using huge amounts of memory (e.g.
> > 64MB),
> resulting in OOM.
> I've created a test case (length is 30 here) that will OOM with -Xmx256M.
> I haven't dug into this much as to what's causing it, but I suspect there
> might be a bug
> revolving around certain punctuation characters: we didn't see this happening
> until
> we beefed up our random string generation to start producing "html-like"
> strings.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira