It makes sense to me. I don't have the full picture, but I did just implement merging for vector format, and that at least, could be done fully concurrent with other formats. I expect the same is true of DocValues, Terms, etc. I'm not sure about the different kinds of DocValues - they might want to be done together?
On Mon, Jan 25, 2021 at 5:45 AM Dawid Weiss <dawid.we...@gmail.com> wrote: > > > Hey everyone, > > I'm trying to cut the total wall-time of indexing for some fairly large > document collections on machines with a high CPU count (> 32 indexing > threads). So far my observations are: > > 1) I resigned from using the concurrent merge scheduler in favor of "same > thread" merging. This means the indexing thread that encounters a merge just > does it. The CMS is designed to favor concurrent searches over indexing and > it really didn't do anything I needed - in fact, I had to disable most things > it offers. I/O throttling and thread stalling are not really practical on > fast I/O in the absence of concurrent searches - you can literally just use > as many merge threads as needed to saturate the I/O. > > 2) It is quite frequent that everything is churning nicely until the last few > merges combine huge smaller segments and form a "long-tail" where most cores > are just idle... Here comes my question - can we execute the individual > "parts" involved in segment merging (the logic inside SegmentMerger) in > separate threads? On the surface it looks like these steps can be done > independently (even if they're executed sequentially at the moment) but > perhaps I'm missing something? > > I'd like to ask before I try to tinker with it. Thanks for any feedback. > > Dawid --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org