These sort of tricks can help things some if index i/o is really your bottleneck. Are you convinced that it is? When i/o is a bottleneck the CPU typically spends a large portion of its time idle. Do you see this?

From your description (indexing ~300k 5k documents takes over 24 hours) I would be very surprised if index i/o is your bottleneck. Rather I would might suspect the XML parsing or somesuch.

In general, Lucene's default settings are designed to give good performance. If pumping up some parameter made a huge performance improvement with little other impact then it would be pumped up by default. Increasing the mergeFactor speeds things somewhat, but it also causes more file handles to be used.

When Karl talks of "flushing" a RAM-based index to disk, I suspect he's using IndexWriter.addIndexes(). Reading his message, I'd be surprised if his performance is really much better than it would be if he just set mergeFactor to 50 and then optimized the index just once at the end, and that is a lot less work.

Doug

Michael Barry wrote:
Thanks for all the info. I've been working on streamlining my indexing and I've finally
found the message from last year that intrigued me


(http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=1220).


In that message, karl �ie suggests


1. use a ramdir, and mutliple fsdirs
2. merge the fsdirs into a single fsdir
3. use threads

(Of course he provides more details.)

I have a question concerning RAMDirectories - is there any benefit using them over setting the
mergeFactor higher? Also, I notice a lot of advice to use RAMDirectories but not much verbage on
how to use them effectively.


In the above msg from Karl, he suggests writing to a RAMDirectory and then at
some point flush the RAMDirectory to an FSDirectory. Anyone have any code to illuminate
that? It's the "flushing" part that's getting me. Is flushing just calling list() on the
RAMDirectory and then deleteFile() each one? Originally I was just creating a new
RAMDirectory each time I needed one (not the best approach but it does work).


I know I should spend time profiling the code and see exactly where the bottle necks
occur and I will do that but I'd like to get a good handle on the multiple ways to
index also.


Thanks for your time, Mike.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to