You might want to look at my indexing of 6.4 million PDF articles,
full-text and metadata. It resulted in an 83GB index taking 20.5 hours
to run. It uses multiple writers, is massively multithreaded.

More info here:
http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html
Check-out the notes at the bottom for details.

In order to make threading/queues much easier and more robust, you
want to use: java.util.concurrent.ThreadPoolExecutor
http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/ThreadPoolExecutor.html

Even with these, I've also had problems like you describe. One thing
I've found is that you need to shut the  ThreadPoolExecutor down
correctly, something like:
                 threadPoolExecutor.shutdown();
                while(!threadPoolExecutor.isTerminated())
                {
                    try {
                        Thread.sleep(ShutdownDelay);
                    } catch (InterruptedException ie) {
                        System.out.println(" interrupted");
                    }
                }

You also need to simplify your threading so as to make reduce deadlock
possibilities.

I hope this is useful.

-Glen

2008/10/23 Sudarsan, Sithu D. <[EMAIL PROTECTED]>:
>
> Hi,
>
> We are trying to index large collection of PDF documents, sizes varying
> from few KB to few GB.  Lucene 2.3.2 with jdk 1.6.0_01 (with PDFBox for
> text extraction) and on Windows as well as CentOS Linux. Used java -Xms
> and -Xmx options, both at 1080m, even though we have 4GB on Windows and
> 32 GB on Linux with sufficient swap space.
>
> With just one thread, though it takes time, the indexing happens. To
> speed up, we tried multi-threaded approach with one Indexwriter for each
> thread. After all the threads finish their indexing, they are merged.
> With about 100 sample files and 10 threads, the program works pretty
> well and it does speed up. But, when we run on document collection of
> about 25GB, couple of threads just hang, while the rest have completed
> their indexing. The program never gracefully exits, and the threads that
> seem to have died ensure that the final index merging does not take
> place. The program needs to be manually terminated.
>
> Tried both with simple analyzer as well as standard analyzer, with
> similar results.
>
> Any useful tips / solutions welcome.
>
> Thanks in advance,
> Sithu Sudarsan
> Graduate Research Assistant, UALR
> & Visiting Researcher, CDRH/OSEL
>
> [EMAIL PROTECTED]
> [EMAIL PROTECTED]
>
>



-- 

-

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to