[ 
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17796940#comment-17796940
 ] 

ASF GitHub Bot commented on OPENNLP-421:
----------------------------------------

kinow commented on PR #568:
URL: https://github.com/apache/opennlp/pull/568#issuecomment-1856892659

   Thanks for the GC numbers. The new code seems a bit heavier on the GC and 
mem allocations, but I think that was expected.
   
   A bit late here, but I decided to give it a try. Excuse if I say anything 
silly, I can check these results again tomorrow after having some :coffee: . I 
opened your branch on IntelliJ, then made the jmh profile active by default 
(how do you tell IntelliJ to include the deps of a profile? anywho), 
invalidated cache, etc..
   
   Executed the tests just to make sure they were running, as you already 
provided the results. Then executed with the profiler, to generate the `.jfr` 
file. Saved the `jmh` source folder, checked out `main`, then patched the 
`pom.xml` files, and executed the same test again, to get another `.jfr` for 
the `main` branch (so earlier file is for this branch, older is for main).
   
   Opened both in IntelliJ and selected the `.jfr` file from this branch to 
compare with the one from `main`. Not sure if they show a fair comparison as 
the difference in throughput might exacerbate some number, but in case you find 
it useful, @rzo1 . 
   
   ## CPU samples
   
   
![flamegraph](https://github.com/apache/opennlp/assets/304786/5d715064-5ea0-4dbf-8d89-f8c277ad0a1f)
   
   I think it just confirms the change of calls to intern functions.
   
   
![image](https://github.com/apache/opennlp/assets/304786/26235942-e711-4ed3-b242-18098b9ddca1)
   
   Also the increase in GC calls.
   
   
![image](https://github.com/apache/opennlp/assets/304786/759928d2-1674-426a-a80a-24a39bf90ff8)
   
   And I think the Random calls are from JMH for having more/less samples due 
to higher throughput.
   
   
![image](https://github.com/apache/opennlp/assets/304786/a0eb0c07-8feb-421f-bd1d-01f56135479e)
   
   ## Memory allocations
   
   
![flamegraph](https://github.com/apache/opennlp/assets/304786/7e20ff7b-3b5d-45be-9984-67229c9441c6)
   
   With a lot more Strings, as expected, as well as bytes due to the array copy 
calls.
   
   
![image](https://github.com/apache/opennlp/assets/304786/d3f203ff-1bbc-4d33-8c70-c2ed93861443)
   
   
![image](https://github.com/apache/opennlp/assets/304786/43ba112d-2402-4c68-966f-f2e335334f74)
   
   Unfortunately I don't have any better or more practical test to compare 
memory (maybe a single run of a large example would be better for comparing 
memory, instead of multiple jmh runs…). But I don't think this should really be 
a problem. I wonder if that was some premature optimization or if at some point 
someone had memory issues. But a change that's easy to revert if needed, and 
that way users won't have to modify heap settings. So I am inclined to approve 
it! :tada: 
   
   Will wait until tomorrow so you, Martin, Jeff, others had some time to 
review :sleeping_bed: 
   
   Thanks!
   
   Bruno




> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> ------------------------------------------------------------------------------
>
>                 Key: OPENNLP-421
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-421
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Name Finder
>    Affects Versions: tools-1.5.2-incubating
>         Environment: RedHat 5, JDK 1.6.0_29
>            Reporter: Jay Hacker
>            Assignee: Richard Zowalla
>            Priority: Minor
>              Labels: performance
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>  
> calls intern() on every String.  Presumably this is an attempt to reduce 
> memory usage for duplicate tokens.  Interned Strings are stored in the JVM's 
> permanent generation, which has a small fixed size (seems to be about 83 MB 
> on modern 64-bit JVMs: 
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
>   Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen 
> space.  
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to 
> the JVM.  However, this option is non-standard and not well known, and it 
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.  
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however, 
> there is a huge amount of room for performance improvement.  Currently, 
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
>     copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence:        // of which 
> there are O(N^2)
>     copy the sequence into a new array
>     if the last token is in the "meta dictionary":
>         make a StringList from the tokens
>         look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set<StringListWrapper>, which 
> wraps StringList, which wraps Array<String>.  Every entry in the dictionary 
> requires at least four allocated objects (in addition to the Strings): Array, 
> StringList, StringListWrapper, and HashMap.Entry.  Even contains and remove 
> allocate new objects!
> From this comment in DictionaryNameFinder:
>         // TODO: improve performance here
> It seems like improvements would be welcome.  :)  Removing some of the object 
> overhead would more than make up for interning strings.  Should I create a 
> new Jira ticket to propose a more efficient design?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to