Multi -threaded indexing of large number of PDF documents

Sudarsan, Sithu D. Thu, 23 Oct 2008 09:24:23 -0700

Hi,

We are trying to index large collection of PDF documents, sizes varying
from few KB to few GB.  Lucene 2.3.2 with jdk 1.6.0_01 (with PDFBox for
text extraction) and on Windows as well as CentOS Linux. Used java -Xms
and -Xmx options, both at 1080m, even though we have 4GB on Windows and
32 GB on Linux with sufficient swap space.


With just one thread, though it takes time, the indexing happens. To
speed up, we tried multi-threaded approach with one Indexwriter for each
thread. After all the threads finish their indexing, they are merged.
With about 100 sample files and 10 threads, the program works pretty
well and it does speed up. But, when we run on document collection of
about 25GB, couple of threads just hang, while the rest have completed
their indexing. The program never gracefully exits, and the threads that
seem to have died ensure that the final index merging does not take
place. The program needs to be manually terminated. 

Tried both with simple analyzer as well as standard analyzer, with
similar results.

Any useful tips / solutions welcome.

Thanks in advance,
Sithu Sudarsan
Graduate Research Assistant, UALR
& Visiting Researcher, CDRH/OSEL

[EMAIL PROTECTED]
[EMAIL PROTECTED]

Multi -threaded indexing of large number of PDF documents

Reply via email to