On Tue, Jan 26, 2021 at 8:29 AM Adrien Grand <jpou...@gmail.com> wrote:
> For full text collections, I believe that the bottleneck is usually 
> terms+postings so it might not save much time. Maybe we could also 
> parallelize on a per-field basis by writing to temporary files and then 
> copying the raw data to the target segment part. For instance for the 
> Wikipedia dataset we use for nightly benchmarks, maybe the inverted indexes 
> for 'title' and 'body' could be merged in parallel this way.
>
>

if you want to experiment with something like that, you can hackishly
simulate it today to quickly see the overhead, correct? its a small
hack to PerFieldPostingsFormat to force it to emit "files-per-field"
and then CFS will combine it all together.

but doing it explicitly and then making our own internal "compound"
seems kinda risky, wouldn't all the offsets be wrong without further
file changes (e.g. per-field "start offset" where all the postings for
that field begin) ?

And this does nothing to solve dawid's problem of slow vectors, if you
have vectors on that's always gonna be the bottleneck and those are
per-doc.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to