[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jay Hacker updated OPENNLP-421:
-------------------------------

    Description: 
The current implementation of {{StringList}}:

https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
 calls intern() on every String.  Presumably this is an attempt to reduce 
memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
permanent generation, which has a small fixed size (seems to be about 83 MB on 
modern 64-bit JVMs: 
[http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
  Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen space.  

The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
the JVM.  However, this option is non-standard and not well known, and it would 
be nice if OpenNLP worked out of the box without deep JVM tuning.


This immediate problem could be fixed by simply not interning Strings.  Looking 
at the Dictionary and DictionaryNameFinder code as a whole, however, there is a 
huge amount of room for performance improvement.  Currently, 
DictionaryNameFinder.find works something like this:

for every token in every tokenlist in the dictionary:
    copy it into a "meta dictionary" of single tokens

for every possible subsequence of tokens in the sentence:        // of which 
there are O(N^2)
    copy the sequence into a new array
    if the last token is in the "meta dictionary":
        make a StringList from the tokens
        look it up in the dictionary

Dictionary itself is very heavyweight: it's a Set<StringListWrapper>, which 
wraps StringList, which wraps Array<String>.  Every entry in the dictionary 
requires at least four allocated objects (in addition to the Strings): Array, 
StringList, StringListWrapper, and HashMap.Entry.  Even put and remove allocate 
new objects!

>From this comment in DictionaryNameFinder:

        // TODO: improve performance here

It seems like improvements would be welcome.  :)  Removing some of the object 
overhead would more than make up for interning strings.  Should I create a new 
Jira ticket to propose a more efficient design?

  was:
The current implementation of 
[StringList|https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup]
 calls `intern()` on every String.  Presumably this is an attempt to reduce 
memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
permanent generation, which has a small fixed size (seems to be about 83 MB on 
modern 64-bit JVMs: 
[http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
  Once this fills up, the JVM crashes with an `OutOfMemoryError: PermGen 
space`.  

The size of the PermGen can be increased with the `-XX:MaxPermSize=` option to 
the JVM.  However, this option is non-standard and not well known, and it would 
be nice if OpenNLP worked out of the box without deep JVM tuning.


This immediate problem could be fixed by simply not interning Strings.  Looking 
at the `Dictionary` and `DictionaryNameFinder` code as a whole, however, there 
is a huge amount of room for performance improvement.  Currently, 
`DictionaryNameFinder.find` works something like this:

{code}
for every token in every tokenlist in the dictionary:
    copy it into a "meta dictionary" of single tokens

for every possible subsequence of tokens in the sentence:        // of which 
there are O(N^2)
    copy the sequence into a new array
    if the last token is in the "meta dictionary":
        make a StringList from the tokens
        look it up in the dictionary
{code}

`Dictionary` itself is very heavyweight: it's a `Set<StringListWrapper>`, which 
wraps `StringList`, which wraps `Array<String>`.  Every entry in the dictionary 
requires at least four allocated objects (in addition to the Strings): `Array`, 
`StringList`, `StringListWrapper`, and `HashMap.Entry`.  Even `put` and 
`remove` allocate new objects!

>From this comment in `DictionaryNameFinder`:

{code}
        // TODO: improve performance here
{code}

It seems like improvements would be welcome.  :)  Removing some of the object 
overhead would more than make up for interning strings.  Should I create a new 
Jira ticket to propose a more efficient design?


Apparently your JIRA doesn't like markup.
                
> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> ------------------------------------------------------------------------------
>
>                 Key: OPENNLP-421
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-421
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Name Finder
>    Affects Versions: tools-1.5.2-incubating
>         Environment: RedHat 5, JDK 1.6.0_29
>            Reporter: Jay Hacker
>            Priority: Minor
>              Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of {{StringList}}:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
>     copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:        // of which 
> there are O(N^2)
>     copy the sequence into a new array
>     if the last token is in the "meta dictionary":
>         make a StringList from the tokens
>         look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set<StringListWrapper>, which 
> wraps StringList, which wraps Array<String>.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even put and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
>         // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to