[jira] Created: (JCR-1931) SharedFieldCache$StringIndex memory leak causing OOM's

Ard Schrijvers (JIRA) Thu, 08 Jan 2009 09:37:22 -0800

SharedFieldCache$StringIndex memory leak causing OOM's 
-------------------------------------------------------


                 Key: JCR-1931
                 URL: https://issues.apache.org/jira/browse/JCR-1931
             Project: Jackrabbit
          Issue Type: Bug
          Components: query
    Affects Versions: 1.5.0
            Reporter: Ard Schrijvers
            Assignee: Ard Schrijvers
            Priority: Critical
             Fix For: 1.5.1


SharedFieldCache$StringIndex is not working properly. It is meant to cache the 
docnumbers in lucene along with the term to sort on. The issue is twofold. I 
have a solution for the second one, the first one is not really solvable from 
jr pov, because lucene index readers are already heavily caching Terms. 

Explanation of the problem:

For *each* unique property where is sorted on, a new lucene ScoreDocComparator 
is created (see SharedFieldComparator newComparator). This new comparator 
creates *per* lucene indexreader  SharedFieldCache.StringIndex which is stored 
in a WeakHashMap with as key, the indexreader . As this indexreader  almost 
*never* can be garbage collected (only if it is merged and thus unused after), 
the SharedFieldCache.StringIndex are there to be the rest of the jvm life 
(which is sometime short, as can be seen from the simple unittest attached).  
Obviously, this results pretty fast in OOM.

1) issue one:  The cached terms[] in SharedFieldCache.StringIndex can become 
huge when you sort on a common property (date) which is present in a lot of 
nodes. It you sort on large properties, like 'title' this 
SharedFieldCache.StringIndex  will quickly use hundreds of Mb for a couple of 
hundred of thousand of nodes with a title. This issue is already a lucene 
issue, as lucene already caches the terms. OTOH, I really doubt whether we 
should index long string values as UNTOKENIZED in lucene at all. A half working 
solution might be a two-step solution, where the first sort is on the first 10 
chars, and only if the comparator returns 0, take the entire string to sort on

2) issue two:  The cached terms[] in SharedFieldCache.StringIndex is frequently 
sparse, consuming an incredible amount of memory for string arrays containing 
mainly null values. For example (see attached unit test):

- add 1.000.000 nodes
- do a query and sort on a non existing property
- you'll loose 1.000.000 * 4 bytes ~ 4 Mb of memory
- sort on another non existing prop : another 4 Mb is lost
- do it 100 times --> 400 Mb is lost, and can't be reclaimed

I'll attach a solution which works really fine for me, still having the almost 
unavoidable memory absorption, but makes it much smaller. The solution is, that 
if < 10% of the String array is filled, i consider the array already sparse, 
and move to a HashMap solution. Performance does not decrease much (and in case 
of large sparsity increases because less memory consumption --> less gc, etc). 

Perhaps it does not seem to be a common issue (certainly the unit test) but our 
production environments memory snapshots indicate most memory being held by the 
SharedFieldCache$StringIndex (and the lucene Terms, which is harder to avoid)

I'd like to see this in the 1.5.1 if others are ok with it




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (JCR-1931) SharedFieldCache$StringIndex memory leak causing OOM's

Reply via email to