jpountz commented on PR #15160: URL: https://github.com/apache/lucene/pull/15160#issuecomment-3258969230
Even though 512 performs better on benchmarks, I'm leaning towards going with 256: - memory usage should be taken into account as well (each PostingsEnum maintains 3 arrays size based on this block size: doc IDs, term frequencies, and a temporary array used for decoding) - I like the simplicity of 256 where we can keep encoding indexes in the array as bytes. E.g. skip data for positions needs to record the offset in the positions block of the first position of the first doc ID in the doc block, PFOR's patches need to record indexes where exceptions happen. It's a bit simpler if it can be encoded as bytes. - 256 is a bit safer wrt regressions: most queries get faster but as we see with `FilteredTerm`, some queries also get slower. - I only remember seeing 128 or 256 as block sizes in the IR literature. I have a bias towards not diverging too much from what is documented in the literature. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org