It makes sense to me. I don't have the full picture, but I did just
implement merging for vector format, and that at least, could be done
fully concurrent with other formats. I expect the same is true of
DocValues, Terms, etc. I'm not sure about the different kinds of
DocValues - they might want to be done together?

On Mon, Jan 25, 2021 at 5:45 AM Dawid Weiss <dawid.we...@gmail.com> wrote:
>
>
> Hey everyone,
>
> I'm trying to cut the total wall-time of indexing for some fairly large 
> document collections on machines with a high CPU count (> 32 indexing 
> threads). So far my observations are:
>
> 1) I resigned from using the concurrent merge scheduler in favor of "same 
> thread" merging. This means the indexing thread that encounters a merge just 
> does it. The CMS is designed to favor concurrent searches over indexing and 
> it really didn't do anything I needed - in fact, I had to disable most things 
> it offers. I/O throttling and thread stalling are not really practical on 
> fast I/O in the absence of concurrent searches - you can literally just use 
> as many merge threads as needed to saturate the I/O.
>
> 2) It is quite frequent that everything is churning nicely until the last few 
> merges combine huge smaller segments and form a "long-tail" where most cores 
> are just idle... Here comes my question - can we execute the individual 
> "parts" involved in segment merging (the logic inside SegmentMerger) in 
> separate threads? On the surface it looks like these steps can be done 
> independently (even if they're executed sequentially at the moment) but 
> perhaps I'm missing something?
>
> I'd like to ask before I try to tinker with it. Thanks for any feedback.
>
> Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to