I like the idea Paul. As far as how it should be implemented, perhaps a count of docs in memory should be kept. It doesn't seem necessary to traverse all of the segments on every add (it's a linear operation, and will only result in a merge every "minMergeDocs" or "maxBufferedDocs").
-Yonik On 5/16/05, Paul Smith <[EMAIL PROTECTED]> wrote: > In summary, I still firmly believe that the IndexWriter.maybeMergeSegments() > is chewing a lot more CPU than would be ideal. So I ran a simple test. I > ran the same test I've done before, using mergeFactor(1000) > maxBufferedDocs(10000), useCompondFile(false), indexing 5 fields (user > first/lastname/email address) > > As a baseline using the latest SVN source code, I'm getting an indexing rate > of between 490-515 items/second of a number of runs. > > By applying the attached simple patch to IndexWriter, I'm getting between > 945-970 of a number of test runs. That's a significant speed up. All the > patch is doing is deferring the call to maybeMergeSegments so it only does > it every 2000 iterations (2000 is totally arbitrary on my part). > > I've verified with Luke that the index generated contains the same # > documents, and same # terms, but I have not had a chance to properly setup > my local environment to run the test cases. > > Obviously the attached patch is a dirty hack of the highest order. In my > case I'm re-indexing from scratch every time, so there may be a reason why > we shouldn't be doing this sort of deferring of method calls. Perhaps the > source code is optimized around incremental/batch updates to _existing_ > indexes, but creating a new index, but with a penalty of creating a new > index performs slower than one would like. > > Perhaps IndexWriter could benefit from another setting that lets one > configure how often to call maybeMergeSegments()? That could of course > confuse more people than it helps. > > I would really appreciate anyones thoughts on this, I'll be very happy to be > proven wrong because it will just help me understand more of Lucene. I > would hope that speeding up indexing would benefit everyone? Particularly > the large scale sites out there. > > cheers, > > Paul Smith --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]