You might want to look at my indexing of 6.4 million PDF articles, full-text and metadata. It resulted in an 83GB index taking 20.5 hours to run. It uses multiple writers, is massively multithreaded.
More info here: http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html Check-out the notes at the bottom for details. In order to make threading/queues much easier and more robust, you want to use: java.util.concurrent.ThreadPoolExecutor http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/ThreadPoolExecutor.html Even with these, I've also had problems like you describe. One thing I've found is that you need to shut the ThreadPoolExecutor down correctly, something like: threadPoolExecutor.shutdown(); while(!threadPoolExecutor.isTerminated()) { try { Thread.sleep(ShutdownDelay); } catch (InterruptedException ie) { System.out.println(" interrupted"); } } You also need to simplify your threading so as to make reduce deadlock possibilities. I hope this is useful. -Glen 2008/10/23 Sudarsan, Sithu D. <[EMAIL PROTECTED]>: > > Hi, > > We are trying to index large collection of PDF documents, sizes varying > from few KB to few GB. Lucene 2.3.2 with jdk 1.6.0_01 (with PDFBox for > text extraction) and on Windows as well as CentOS Linux. Used java -Xms > and -Xmx options, both at 1080m, even though we have 4GB on Windows and > 32 GB on Linux with sufficient swap space. > > With just one thread, though it takes time, the indexing happens. To > speed up, we tried multi-threaded approach with one Indexwriter for each > thread. After all the threads finish their indexing, they are merged. > With about 100 sample files and 10 threads, the program works pretty > well and it does speed up. But, when we run on document collection of > about 25GB, couple of threads just hang, while the rest have completed > their indexing. The program never gracefully exits, and the threads that > seem to have died ensure that the final index merging does not take > place. The program needs to be manually terminated. > > Tried both with simple analyzer as well as standard analyzer, with > similar results. > > Any useful tips / solutions welcome. > > Thanks in advance, > Sithu Sudarsan > Graduate Research Assistant, UALR > & Visiting Researcher, CDRH/OSEL > > [EMAIL PROTECTED] > [EMAIL PROTECTED] > > -- - --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]