[
https://issues.apache.org/jira/browse/LUCENE-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Uwe Schindler updated LUCENE-1520:
----------------------------------
Attachment: LUCENE-1520.patch
Again a slightly improved patch. byte[] is only allocated one time for all
fields in CheckIndex. The length check is unnecessary, because the array is
preallocated to maxDoc. Moved this a little bit modified to compare SegmentInfo
docCount and Reader maxDoc
> OOM erros with CheckIndex with indexes containg a lot of fields with norms
> --------------------------------------------------------------------------
>
> Key: LUCENE-1520
> URL: https://issues.apache.org/jira/browse/LUCENE-1520
> Project: Lucene - Java
> Issue Type: Bug
> Components: Index
> Affects Versions: 2.9
> Reporter: Uwe Schindler
> Assignee: Michael McCandless
> Fix For: 2.9
>
> Attachments: LUCENE-1520.patch, LUCENE-1520.patch
>
>
> All index readers have a cache of the last used norms (SegmentReader,
> MultiReader, MultiSegmentReader,...). This cache is never cleaned up, so if
> you access norms of a field, the norm's byte[maxdoc()] array is not freed
> until you close/reopen the index.
> You can see this problem, if you create an index with many fields with norms
> (I tested with about 4,000 fields) and many documents (half a million). If
> you then call CheckIndex, that calls norms() for each (!) field in the
> Segment and each of this calls creates a new cache entry, you get
> OutOfMemoryExceptions after short time (I tested with the above index: I was
> not able to do a CheckIndex even with "-Xmx 16GB" on 64bit Java).
> CheckIndex opens and then tests each segment of a index with a separate
> SegmentReader. The big index with the OutOfMemory problem was optimized, so
> consisting of one segment with about half a million docs and about 4,000
> fields. Each byte[] array takes about a half MiB for this index. The
> CheckIndex funtion created the norm for 4000 fields and the SegmentReader
> cached them, which is about 2 GiB RAM. So OOMs are not unusal.
> In my opinion, the best would be to use a Weak- or better a SoftReference so
> norms.bytes gets java.lang.ref.SoftReference<byte[]> and used for caching.
> With proper synchronization (which is done on the norms cache in
> SegmentReader) you can do the best with SoftReference, as this reference is
> garbage collected only when an OOM may happen. If the byte[] array is freed
> (but it is only freed if no other references exist), a lter call to
> getNorms() creates a new array. When code is hard referencing the norms
> array, it will not be freed, so no problem. The same could be done for the
> other IndexReaders.
> Fields without norm() do not have this problem, as all these fields share a
> one-time allocated dummy norm array. So the same index without norms enabled
> for most of the fields checked perfectly.
> I will prepare a patch tomorrow.
> Mike proposed another quick fix for CheckIndex:
> bq. we could do something first specifically for CheckIndex (eg it could
> simply use the 3-arg non-caching bytes method instead) to prevent OOM errors
> when using it.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]