Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
------------------------------------------------------------------------------
Key: OPENNLP-421
URL: https://issues.apache.org/jira/browse/OPENNLP-421
Project: OpenNLP
Issue Type: Bug
Components: Name Finder
Affects Versions: tools-1.5.2-incubating
Environment: RedHat 5, JDK 1.6.0_29
Reporter: Jay Hacker
Priority: Minor
The current implementation of
[StringList|https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup]
calls `intern()` on every String. Presumably this is an attempt to reduce
memory usage for duplicate tokens. Interned Strings are stored in the JVM's
permanent generation, which has a small fixed size (seems to be about 83 MB on
modern 64-bit JVMs:
[http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
Once this fills up, the JVM crashes with an `OutOfMemoryError: PermGen
space`.
The size of the PermGen can be increased with the `-XX:MaxPermSize=` option to
the JVM. However, this option is non-standard and not well known, and it would
be nice if OpenNLP worked out of the box without deep JVM tuning.
This immediate problem could be fixed by simply not interning Strings. Looking
at the `Dictionary` and `DictionaryNameFinder` code as a whole, however, there
is a huge amount of room for performance improvement. Currently,
`DictionaryNameFinder.find` works something like this:
{code}
for every token in every tokenlist in the dictionary:
copy it into a "meta dictionary" of single tokens
for every possible subsequence of tokens in the sentence: // of which
there are O(N^2)
copy the sequence into a new array
if the last token is in the "meta dictionary":
make a StringList from the tokens
look it up in the dictionary
{code}
`Dictionary` itself is very heavyweight: it's a `Set<StringListWrapper>`, which
wraps `StringList`, which wraps `Array<String>`. Every entry in the dictionary
requires at least four allocated objects (in addition to the Strings): `Array`,
`StringList`, `StringListWrapper`, and `HashMap.Entry`. Even `put` and
`remove` allocate new objects!
>From this comment in `DictionaryNameFinder`:
{code}
// TODO: improve performance here
{code}
It seems like improvements would be welcome. :) Removing some of the object
overhead would more than make up for interning strings. Should I create a new
Jira ticket to propose a more efficient design?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira