Thanks Bogdan, I've been meaning to bring this up.
Solr used a TreeMap in the past (when it handled it's own deletes) for
the same exact reason.  In my profiling, I've also seen applyDeletes()
taking the bulk of the time with small/simple document indexing.

So we should definitely go in sorted order (either via TreeMap or sort
the HashMap).

-Yonik
http://www.lucidimagination.com

On Fri, Nov 20, 2009 at 7:21 AM, Bogdan Ghidireac <bog...@ecstend.com> wrote:
> Hi,
>
> One of the use case of my application involves updating the index with
> 10 to 10k docs every few minutes. Because we maintain a PK for each
> doc we have to use IndexWriter.updateDocument to be consistent.
>
> The average time for an update when we commit every 10k docs is around
> 17ms (the IndexWriter buffer is 100MB). I profiled the application for
> several hours and I noticed that most of the time is spent in
> IndexWriter.applyDeletes()->TermDocs.seek(). I changed the
> BufferedDeletes.terms from HashMap to TreeMap to have the terms
> ordered and to reduce the number of random seeks on the disk.
>
> I run my tests again with the patched Lucene 2.9.1 and the time has
> dropped from 17ms to 2ms. The index has 18GB and 70 million docs.
>
> I cannot send a patch because my company has some strict and time
> consuming policies about open source but the change is small and can
> be applied easily.
>
> Regards,
> Bogdan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to