Hi folks!

I've been a bit curious to test out different block size configurations in
the Lucene postings list format, but thought I'd reach out to the community
here first to see what work may have gone into this previously. I'm
essentially interested in benchmarking different block size configurations
on the real-world application of Lucene I'm working on.

If my understanding of the code is correct, I know we're currently encoding
compressed runs of 128 docs per block, relying on ForUtil for
encoding/decoding purposes. It looks like we define this in
ForUtil#BLOCK_SIZE (and reference it in a few external classes), but also
know that it's not as simple as just changing that one definition. It
appears much of the logic in ForUtil relies on the assumption of 128
docs-per-block.

I'm toying with the idea of making ForUtil a bit more flexible to allow for
different block sizes to be tested in order to run the benchmarking I'd
like to run, but the class looks heavily optimized to generate SIMD
instructions (I think?), so that might be folly. Before I start hacking on
a local branch to see what I can learn, is there any prior work that might
be useful to be aware of? Anyone gone down this path and have some
learnings to share? Any thoughts would be much appreciated!

Cheers,
-Greg

Reply via email to