Woops! I meant for this to go to java-dev... Mike
On Mon, 19 Mar 2007 08:09:37 -0400, "Michael McCandless" <[EMAIL PROTECTED]> said: > Hi, > > I've been looking into improving performance of IndexWriter, > specifically how it makes use of RAM to buffer added documents. > > I've created a new class (MultiDocumentWriter) that can build a single > segment from many documents at once, more efficiently than the current > single document segment approach. It buffers terms, freqs and > positions in memory and then periodically flushes them together. > > This only affects the creation of an initial segment from added > documents. I haven't changed anything after that, eg how segments are > merged. > > The basic ideas are: > > * Write stored fields and term vectors directly to disk (don't > use up RAM for these). > > * Gather posting lists & term infos in RAM, but periodically do > in-RAM flushes. Once RAM is full, flush buffers to disk (and > merge them later when it's time to make a real segment). > > * When it's time to really build a segment, merge all postings lists > (RAM and flushed) into the real segment files. > > * Recycle buffers/objects when possible (less stress & time spent on > GC). > > I think some of these changes are similar to how KinoSearch builds a > segment. But, I haven't made any changes to Lucene's file format nor > added requirements for a global fields schema. > > With this change you can now tell IndexWriter how much RAM it can use > before flushing, which I think is better than setting max buffered > docs when documents are variable in size. This is in fact the only > externally visible API change :) > > I'm still working through some lingering issues before I can make a > clean patch, but it now passes all unit tests except the disk full > tests (I think we would need to change error semantics on disk full). > > I've run some very initial performance tests and this approach > provides a good speedup when equalizing RAM usage for a fair > comparison, especially when the docs are small. (Note that this > speedup is just for the "indexing" part, and for many Lucene apps I > think other things (eg Analyzer, retrieving docs from the content > source, etc.) are the bottleneck. > > This change also makes "commit only on close" mode (autoCommit=false > to IndexWriter) especially efficient because no segment is produced > until you close the IndexWriter, so no normal segment merging takes > place for the entire session. You can build a massive index having > created only 1 segment at the end. > > Mike > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]