I noticed that commit() was taking an inordinately long time. It turned out
IndexWriter was flushing using only a single thread because it relies on
its caller to supply it with threads (via updateDocument, deleteDocument,
etc), which it then "hijacks" to do flushing. If (as we do) a caller
indexes a lot of documents and then calls commit at the end of a large
batch, when no indexing is ongoing, the commit() takes much longer than
needed since it is unable to make user of multiple cores to do concurrent
I/O.

How can we support this batch-mode use case better? I think we should -
it's not an unreasonable thing to do, since it can lead to the shortest
overall indexing time if you have sufficient RAM and don't need to search
until you're done indexing. I tried adding an IndexWriter.yield() method
that just flushes pending segments and does other queued work; the caller
can invoke this in order to provide resources. A more convenient API would
be to grant IndexWriter an ExecutorService of its own, but this is more
involved since it would ne necessary to arbitrate where the work should be
done. Maybe a middle ground would be to offer a commit(ExecutorService)
method. Any other ideas? Any interest in a patch for IndexWriter.yield()?

-Mike

Reply via email to