Thanks Bogdan, I've been meaning to bring this up. Solr used a TreeMap in the past (when it handled it's own deletes) for the same exact reason. In my profiling, I've also seen applyDeletes() taking the bulk of the time with small/simple document indexing.
So we should definitely go in sorted order (either via TreeMap or sort the HashMap). -Yonik http://www.lucidimagination.com On Fri, Nov 20, 2009 at 7:21 AM, Bogdan Ghidireac <bog...@ecstend.com> wrote: > Hi, > > One of the use case of my application involves updating the index with > 10 to 10k docs every few minutes. Because we maintain a PK for each > doc we have to use IndexWriter.updateDocument to be consistent. > > The average time for an update when we commit every 10k docs is around > 17ms (the IndexWriter buffer is 100MB). I profiled the application for > several hours and I noticed that most of the time is spent in > IndexWriter.applyDeletes()->TermDocs.seek(). I changed the > BufferedDeletes.terms from HashMap to TreeMap to have the terms > ordered and to reduce the number of random seeks on the disk. > > I run my tests again with the patched Lucene 2.9.1 and the time has > dropped from 17ms to 2ms. The index has 18GB and 70 million docs. > > I cannot send a patch because my company has some strict and time > consuming policies about open source but the change is small and can > be applied easily. > > Regards, > Bogdan --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org