+1 to make it simple to let multiple threads help with commit/refresh operations.
IW.yield is a simple way to achieve it, matching (roughly) how IW's commit/refresh work today, hijacking incoming indexing threads to gain concurrency. I think this would be a small change? Adding an ExecutorService to e.g. IndexWriterConfig, so all ops (commit, refresh, eventually also merging which today still spawns its own threads) could be concurrent when possible would be a nice longer term solution but I suspect that's a much more invasive change than the simple IW.yield. Progress not perfection :) Mike McCandless http://blog.mikemccandless.com On Fri, Feb 15, 2019 at 4:11 PM Michael Sokolov <msoko...@gmail.com> wrote: > I noticed that commit() was taking an inordinately long time. It turned out > IndexWriter was flushing using only a single thread because it relies on > its caller to supply it with threads (via updateDocument, deleteDocument, > etc), which it then "hijacks" to do flushing. If (as we do) a caller > indexes a lot of documents and then calls commit at the end of a large > batch, when no indexing is ongoing, the commit() takes much longer than > needed since it is unable to make user of multiple cores to do concurrent > I/O. > > How can we support this batch-mode use case better? I think we should - > it's not an unreasonable thing to do, since it can lead to the shortest > overall indexing time if you have sufficient RAM and don't need to search > until you're done indexing. I tried adding an IndexWriter.yield() method > that just flushes pending segments and does other queued work; the caller > can invoke this in order to provide resources. A more convenient API would > be to grant IndexWriter an ExecutorService of its own, but this is more > involved since it would ne necessary to arbitrate where the work should be > done. Maybe a middle ground would be to offer a commit(ExecutorService) > method. Any other ideas? Any interest in a patch for IndexWriter.yield()? > > -Mike >