[ 
https://issues.apache.org/jira/browse/JCR-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663010#action_12663010
 ] 

Jukka Zitting commented on JCR-1931:
------------------------------------

Seems OK to me too. Ard, can you commit this with Marcel's suggetions? I'll 
then merge it to 1.5 for inclusion in 1.5.1.

> SharedFieldCache$StringIndex memory leak causing OOM's 
> -------------------------------------------------------
>
>                 Key: JCR-1931
>                 URL: https://issues.apache.org/jira/browse/JCR-1931
>             Project: Jackrabbit
>          Issue Type: Bug
>          Components: query
>    Affects Versions: 1.5.0
>            Reporter: Ard Schrijvers
>            Assignee: Ard Schrijvers
>            Priority: Critical
>             Fix For: 1.5.1
>
>         Attachments: JCR-1931.patch, OrderByOOMTest.java
>
>
> SharedFieldCache$StringIndex is not working properly. It is meant to cache 
> the docnumbers in lucene along with the term to sort on. The issue is 
> twofold. I have a solution for the second one, the first one is not really 
> solvable from jr pov, because lucene index readers are already heavily 
> caching Terms. 
> Explanation of the problem:
> For *each* unique property where is sorted on, a new lucene 
> ScoreDocComparator is created (see SharedFieldComparator newComparator). This 
> new comparator creates *per* lucene indexreader  SharedFieldCache.StringIndex 
> which is stored in a WeakHashMap with as key, the indexreader . As this 
> indexreader  almost *never* can be garbage collected (only if it is merged 
> and thus unused after), the SharedFieldCache.StringIndex are there to be the 
> rest of the jvm life (which is sometime short, as can be seen from the simple 
> unittest attached).  Obviously, this results pretty fast in OOM.
> 1) issue one:  The cached terms[] in SharedFieldCache.StringIndex can become 
> huge when you sort on a common property (date) which is present in a lot of 
> nodes. It you sort on large properties, like 'title' this 
> SharedFieldCache.StringIndex  will quickly use hundreds of Mb for a couple of 
> hundred of thousand of nodes with a title. This issue is already a lucene 
> issue, as lucene already caches the terms. OTOH, I really doubt whether we 
> should index long string values as UNTOKENIZED in lucene at all. A half 
> working solution might be a two-step solution, where the first sort is on the 
> first 10 chars, and only if the comparator returns 0, take the entire string 
> to sort on
> 2) issue two:  The cached terms[] in SharedFieldCache.StringIndex is 
> frequently sparse, consuming an incredible amount of memory for string arrays 
> containing mainly null values. For example (see attached unit test):
> - add 1.000.000 nodes
> - do a query and sort on a non existing property
> - you'll loose 1.000.000 * 4 bytes ~ 4 Mb of memory
> - sort on another non existing prop : another 4 Mb is lost
> - do it 100 times --> 400 Mb is lost, and can't be reclaimed
> I'll attach a solution which works really fine for me, still having the 
> almost unavoidable memory absorption, but makes it much smaller. The solution 
> is, that if < 10% of the String array is filled, i consider the array already 
> sparse, and move to a HashMap solution. Performance does not decrease much 
> (and in case of large sparsity increases because less memory consumption --> 
> less gc, etc). 
> Perhaps it does not seem to be a common issue (certainly the unit test) but 
> our production environments memory snapshots indicate most memory being held 
> by the SharedFieldCache$StringIndex (and the lucene Terms, which is harder to 
> avoid)
> I'd like to see this in the 1.5.1 if others are ok with it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to