Hi, We are trying to index large collection of PDF documents, sizes varying from few KB to few GB. Lucene 2.3.2 with jdk 1.6.0_01 (with PDFBox for text extraction) and on Windows as well as CentOS Linux. Used java -Xms and -Xmx options, both at 1080m, even though we have 4GB on Windows and 32 GB on Linux with sufficient swap space.
With just one thread, though it takes time, the indexing happens. To speed up, we tried multi-threaded approach with one Indexwriter for each thread. After all the threads finish their indexing, they are merged. With about 100 sample files and 10 threads, the program works pretty well and it does speed up. But, when we run on document collection of about 25GB, couple of threads just hang, while the rest have completed their indexing. The program never gracefully exits, and the threads that seem to have died ensure that the final index merging does not take place. The program needs to be manually terminated. Tried both with simple analyzer as well as standard analyzer, with similar results. Any useful tips / solutions welcome. Thanks in advance, Sithu Sudarsan Graduate Research Assistant, UALR & Visiting Researcher, CDRH/OSEL [EMAIL PROTECTED] [EMAIL PROTECTED]