[
https://issues.apache.org/jira/browse/JCR-1931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663010#action_12663010
]
Jukka Zitting commented on JCR-1931:
------------------------------------
Seems OK to me too. Ard, can you commit this with Marcel's suggetions? I'll
then merge it to 1.5 for inclusion in 1.5.1.
> SharedFieldCache$StringIndex memory leak causing OOM's
> -------------------------------------------------------
>
> Key: JCR-1931
> URL: https://issues.apache.org/jira/browse/JCR-1931
> Project: Jackrabbit
> Issue Type: Bug
> Components: query
> Affects Versions: 1.5.0
> Reporter: Ard Schrijvers
> Assignee: Ard Schrijvers
> Priority: Critical
> Fix For: 1.5.1
>
> Attachments: JCR-1931.patch, OrderByOOMTest.java
>
>
> SharedFieldCache$StringIndex is not working properly. It is meant to cache
> the docnumbers in lucene along with the term to sort on. The issue is
> twofold. I have a solution for the second one, the first one is not really
> solvable from jr pov, because lucene index readers are already heavily
> caching Terms.
> Explanation of the problem:
> For *each* unique property where is sorted on, a new lucene
> ScoreDocComparator is created (see SharedFieldComparator newComparator). This
> new comparator creates *per* lucene indexreader SharedFieldCache.StringIndex
> which is stored in a WeakHashMap with as key, the indexreader . As this
> indexreader almost *never* can be garbage collected (only if it is merged
> and thus unused after), the SharedFieldCache.StringIndex are there to be the
> rest of the jvm life (which is sometime short, as can be seen from the simple
> unittest attached). Obviously, this results pretty fast in OOM.
> 1) issue one: The cached terms[] in SharedFieldCache.StringIndex can become
> huge when you sort on a common property (date) which is present in a lot of
> nodes. It you sort on large properties, like 'title' this
> SharedFieldCache.StringIndex will quickly use hundreds of Mb for a couple of
> hundred of thousand of nodes with a title. This issue is already a lucene
> issue, as lucene already caches the terms. OTOH, I really doubt whether we
> should index long string values as UNTOKENIZED in lucene at all. A half
> working solution might be a two-step solution, where the first sort is on the
> first 10 chars, and only if the comparator returns 0, take the entire string
> to sort on
> 2) issue two: The cached terms[] in SharedFieldCache.StringIndex is
> frequently sparse, consuming an incredible amount of memory for string arrays
> containing mainly null values. For example (see attached unit test):
> - add 1.000.000 nodes
> - do a query and sort on a non existing property
> - you'll loose 1.000.000 * 4 bytes ~ 4 Mb of memory
> - sort on another non existing prop : another 4 Mb is lost
> - do it 100 times --> 400 Mb is lost, and can't be reclaimed
> I'll attach a solution which works really fine for me, still having the
> almost unavoidable memory absorption, but makes it much smaller. The solution
> is, that if < 10% of the String array is filled, i consider the array already
> sparse, and move to a HashMap solution. Performance does not decrease much
> (and in case of large sparsity increases because less memory consumption -->
> less gc, etc).
> Perhaps it does not seem to be a common issue (certainly the unit test) but
> our production environments memory snapshots indicate most memory being held
> by the SharedFieldCache$StringIndex (and the lucene Terms, which is harder to
> avoid)
> I'd like to see this in the 1.5.1 if others are ok with it
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.