+1 to make a single merge concurrent! It is horribly frustrating to watch that last merge running on a single core :) I have lost many hours of my life to this frustration.
I do think we need to explore concurrency within terms/postings across fields in one segment to really see gains in the common case where merge time is dominated by postings. Mike McCandless http://blog.mikemccandless.com On Tue, Jan 26, 2021 at 9:09 AM Robert Muir <rcm...@gmail.com> wrote: > On Tue, Jan 26, 2021 at 8:29 AM Adrien Grand <jpou...@gmail.com> wrote: > > For full text collections, I believe that the bottleneck is usually > terms+postings so it might not save much time. Maybe we could also > parallelize on a per-field basis by writing to temporary files and then > copying the raw data to the target segment part. For instance for the > Wikipedia dataset we use for nightly benchmarks, maybe the inverted indexes > for 'title' and 'body' could be merged in parallel this way. > > > > > > if you want to experiment with something like that, you can hackishly > simulate it today to quickly see the overhead, correct? its a small > hack to PerFieldPostingsFormat to force it to emit "files-per-field" > and then CFS will combine it all together. > > but doing it explicitly and then making our own internal "compound" > seems kinda risky, wouldn't all the offsets be wrong without further > file changes (e.g. per-field "start offset" where all the postings for > that field begin) ? > > And this does nothing to solve dawid's problem of slow vectors, if you > have vectors on that's always gonna be the bottleneck and those are > per-doc. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >