Re: [pylucene-dev] Large index files: Sort leads to "GC Warning: Repeated allocation of very large block"

Aaron Lav Wed, 19 Dec 2007 10:07:12 -0800

On Wed, Dec 19, 2007 at 10:16:36AM +0100, Marc Weeber wrote:
> Hi Andi and others,
> 
> I downloaded and installed the jcc version (man, that was a positively  
> different experience!), and changed my test script accordingly. The  
> problem is still there: the sort asks for a humongeous amount of  
> memory. I have to provide a maxheap='470m' or it will die with an out  
> of memory error.


I think the problem is that (if I'm reading the code correctly) Lucene
caches in-memory the fields on which you sort, so it doesn't have to
go back to the underlying documents, and so you can sort on indexed
but not stored fields.  See
org.apache.lucene.search.FieldSortedHitQueue.java, which calls
FieldCache.java, which is implemented in FieldCacheImpl.java.

The caches are indexed by reader, and are arrays indexed by document
number, so their length is proportional to the total number of
documents in the index.  Thus, if you have a lot of documents, sorting
by fields can be memory-intensive, especially if the fields are
lengthy strings.  So for your 50M document store, if your per-field
data for sorting is ~8 bytes, that might explain your ~400M additional
memory usage.

If you have a lot of dead space (reader.maxDoc() >> reader.numDocs()),
optimizing should decrease memory usage.

As Andi says, [EMAIL PROTECTED] is more likely to be
helpful here.

    Aaron Lav ([EMAIL PROTECTED] / http://www.pobox.com/~asl2)
_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

Re: [pylucene-dev] Large index files: Sort leads to "GC Warning: Repeated allocation of very large block"

Reply via email to