Hi all,

I am using the the following things
- Debian etch linux
- PyLucene GCC, latest from the GCC trunk
- gcc 4.2.1 with -DLARGE_CONFIG added to the source
- large index of 17Gb, 50M documents

In this index, I want to look for the cooccurrence of two words. For this, I use a booleanQuery:

q = PyLucene.BooleanQuery()
q.add(PyLucene.TermQuery(PyLucene.Term('profile', 'umls/C0086418')), PyLucene.BooleanClause.Occur.MUST) q.add(PyLucene.TermQuery(PyLucene.Term('profile', 'umls/C0003062')), PyLucene.BooleanClause.Occur.MUST)

In this case, the cooccurrence is in about 30,000 documents

this all goes OK if I do a search, it eats about 120M of memory. However, if I sort on another field using PyLucene.Sort('date', False), I get the "GC Warning: Repeated allocation of very large block" . This process eats about 500M of memory.

Interestingly, if I use a query term that does not occur in the index (and cooccurrence is 0), it still costs 500M of memory. Also, before I compiled with -DLARGE_CONFIG, memory use was lower but the warning was still there

Is there a way to a) be more prudent on the memory usage or b) another more memory efficient (and without warnings) way of getting the cooccurrence info?

thanks in advance for any insights from all of you,

best,

Marc

_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

Reply via email to