On Tue, Jan 26, 2021 at 8:29 AM Adrien Grand <jpou...@gmail.com> wrote: > For full text collections, I believe that the bottleneck is usually > terms+postings so it might not save much time. Maybe we could also > parallelize on a per-field basis by writing to temporary files and then > copying the raw data to the target segment part. For instance for the > Wikipedia dataset we use for nightly benchmarks, maybe the inverted indexes > for 'title' and 'body' could be merged in parallel this way. > >
if you want to experiment with something like that, you can hackishly simulate it today to quickly see the overhead, correct? its a small hack to PerFieldPostingsFormat to force it to emit "files-per-field" and then CFS will combine it all together. but doing it explicitly and then making our own internal "compound" seems kinda risky, wouldn't all the offsets be wrong without further file changes (e.g. per-field "start offset" where all the postings for that field begin) ? And this does nothing to solve dawid's problem of slow vectors, if you have vectors on that's always gonna be the bottleneck and those are per-doc. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org