mikemccand opened a new issue, #12477:
URL: https://github.com/apache/lucene/issues/12477

   ### Description
   
   Lucene has an efficient (storage and CPU) compressor for monotonic long 
values, that simply makes "best fit" (ish?) linear model to the N monotonic 
values, and then encodes the precise error signal (positive and negative) of 
each value.  I think we use it for doc values in certain cases?  Or maybe only 
for in-memory data structures?
   
   Lucene's block postings is different: it encodes the docid delta between 
each document.  But because postings are encoded this way, we have a 
linear/chained data dependency and must sequentially add up all the docid 
deltas to get the true docid of each of the 128 postings in the block.
   
   Could we change postings to instead encode with the linear fit?  We'd maybe 
lose some compression (having to store negative and positive -- `ZInt`), but 
then decoding could be done concurrently with simple SIMD math, and then 
skipping might be able to do a binary search within the block?
   
   I know this (efficient block `int[]` encoding for SIMD decode) is a 
well/heavily studied area in the literature :)  I'm sure there is already a 
good option on  [Daniel Lemire's blog](https://lemire.me/blog/) somewhere!
   
   It's not so simple, though, because we also accumulate the `freq` of each 
posting so we can know where we are in the positions/payloads block space.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to