hi all,

On 19 dec 2007, at 19:04, Aaron Lav wrote:

On Wed, Dec 19, 2007 at 10:16:36AM +0100, Marc Weeber wrote:
Hi Andi and others,

I downloaded and installed the jcc version (man, that was a positively
different experience!), and changed my test script accordingly. The
problem is still there: the sort asks for a humongeous amount of
memory. I have to provide a maxheap='470m' or it will die with an out
of memory error.

I think the problem is that (if I'm reading the code correctly) Lucene
caches in-memory the fields on which you sort, so it doesn't have to
go back to the underlying documents, and so you can sort on indexed
but not stored fields.  See
org.apache.lucene.search.FieldSortedHitQueue.java, which calls
FieldCache.java, which is implemented in FieldCacheImpl.java.

The caches are indexed by reader, and are arrays indexed by document
number, so their length is proportional to the total number of
documents in the index.  Thus, if you have a lot of documents, sorting
by fields can be memory-intensive, especially if the fields are
lengthy strings.  So for your 50M document store, if your per-field
data for sorting is ~8 bytes, that might explain your ~400M additional
memory usage.

I think you're right. The field to sort on is a date field in the string format of YYYY-MM-DD. I indeend started looking into the java sorting things, and I am not too much surprised any more of the memory load. Good thing is that after the first search+sort, it is *really* fast: a cooccurrence search (two terms per doc in a boolean query) together with a sort on date in the 50M collection is between 50ms and 200ms (timed in python, before and after the search) , with no real difference between jcc and gcc scripts



If you have a lot of dead space (reader.maxDoc() >> reader.numDocs()),
optimizing should decrease memory usage.
do you mean a .optimize() on the index? That I already have done. Or do you mean something different?


As Andi says, [EMAIL PROTECTED] is more likely to be
helpful here.

I have done that, I'll wait for a reply there

thansk for your help,

Marc




   Aaron Lav ([EMAIL PROTECTED] / http://www.pobox.com/~asl2)
_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

Reply via email to