Hi,

We're using PyLucene to create an index of our python-based database (for
searching). All works fine until the indexer is tested on a larger scale
database (like one with > 2.000.000 documents). First we got an "IOError:
Too many open files" during index creation which we could fix by using the
compound file format (writer.setUseCompoundFile(True)).

Next we observed "GC Warning" messages during index creation :
 GC Warning: Repeated allocation of very large block (appr. size 1466368):
             May lead to memory leak and poor performance.
             
And all of a sudden (well after 8 hours indexing) the process crashes with a
[Python] MemoryError.

We're using Python-2.5.1 with PyLucene 2.0.0.8 on SunOS Solaris. 
PyLucene was built with gcc version 4.1.2 (gcc/sparc-sun-solaris2.9/4.1.2)

'make test' succeeds - though there is one GC warning (on different tests):
 GC Warning: Large stack limit(2147479552): only scanning 8 MB

Has anyone suffered similar problems or knows a solution?

We've been trying to track the problem and even tried to use frequent
"manual cleanup" [by calling Python's gc.collect() and PyLucene's
System.gc()] but that didnt seem to help as the process still grows
constantly during indexing (more details are given below). 

I've gone through most of the mailing list - including the thread on
"pylucene and 2gb limit of files" and found that similar problems have been
observed already - usually the suggestion is to use a different (or patched)
version of GCC though this doesnt seem to fix the problem in all cases. I
don't think the "2gb limit of files" could be a cause here - the Index (on
disc) actually was only about 400 MB when the process died. 

So next we'd like to test with GCC 4.2 - if that is suggested. Is there a
recommended/stable install-config (PyLucene+GCC/GCJ version) for Sun
Solaris? 

Are the (in this list) discussed problems with GCC/GCJ fixed in its current
4.2.0 release and is it suggested to use this version for PyLucene 2.0?

If there's any other way to get rid of the GC Warning (and memory leak) that
would be of interest of course...


Any help would be appreciated.


Kind regards

Thomas Koch
--
OrbiTeam Software GmbH & Co. KG     
http://www.orbiteam.de
Bonn (Germany)         

--
Note:

Indexing is done in a batch process - the process reads the whole database
and puts a new document into the index for each object in the database.

 open index writer
 Merge factor:  100
 Max buffered docs: 100
 Max merge docs:9999999
 12:56:13 running database scan...
 memory usage: 42.69 MB
...
[now while documents are added to the index the memory 
 consumption of the python process grows and grows...]
 13:23:18 added to index: 44691
 memory usage: 161.851562 MB

_______________________________________________
pylucene-dev mailing list
[email protected]
http://lists.osafoundation.org/mailman/listinfo/pylucene-dev

Reply via email to