Ok, I'm just following up on my email from 29th April titled '[Performanc]' (don't you love it when you send before you've typed your subject line completely). The thread is here: In summary, I still firmly believe that the IndexWriter.maybeMergeSegments() is chewing a lot more CPU than would be ideal. So I ran a simple test. I ran the same test I've done before, using mergeFactor(1000) maxBufferedDocs(10000), useCompondFile(false), indexing 5 fields (user first/lastname/email address) As a baseline using the latest SVN source code, I'm getting an indexing rate of between 490-515 items/second of a number of runs. By applying the attached simple patch to IndexWriter, I'm getting between 945-970 of a number of test runs. That's a significant speed up. All the patch is doing is deferring the call to maybeMergeSegments so it only does it every 2000 iterations (2000 is totally arbitrary on my part). I've verified with Luke that the index generated contains the same # documents, and same # terms, but I have not had a chance to properly setup my local environment to run the test cases. Obviously the attached patch is a dirty hack of the highest order. In my case I'm re-indexing from scratch every time, so there may be a reason why we shouldn't be doing this sort of deferring of method calls. Perhaps the source code is optimized around incremental/batch updates to _existing_ indexes, but creating a new index, but with a penalty of creating a new index performs slower than one would like. Perhaps IndexWriter could benefit from another setting that lets one configure how often to call maybeMergeSegments()? That could of course confuse more people than it helps. I would really appreciate anyones thoughts on this, I'll be very happy to be proven wrong because it will just help me understand more of Lucene. I would hope that speeding up indexing would benefit everyone? Particularly the large scale sites out there. cheers, Paul Smith |
IndexWriter.patch
Description: Binary data