Hello Tony
Is it possible to write a block of docfreqs and then a block of
postingoffsets?
Or why not write them as 10-bit integers and then split to quad and sextet
in the posting format code?

On Mon, Oct 16, 2023 at 11:50 PM Dongyu Xu <dongyu...@hotmail.com> wrote:

> Hi devs,
>
> As I was working on https://github.com/apache/lucene/issues/12513 I
> needed to compress positive integers which are used to locate postings etc.
>
> To put it concretely, I will need to pack a few values per term
> contiguously and those values can have different bit-width. For example,
> consider that we need to encode docFreq and postingsStartOffset per term
> and docFreq takes 4 bit and the postingsStartOffset takes 6 bit. We
> expect to write the following for two terms.
>
> ```
> Term1 |  Term2
>
> docFreq(4bit) | postingsStartOffset(6bit) | docFreq(4bit) |
> postingsStartOffset(6bit)
>
> ```
>
> On the read path, I expect to locate the offest for a term first and
> followed by reading two values that have different bit-width.
>
> In the spirit of not re-inventing necessarily, I tried to explore the
> existing PackedInts util classes and I believe there is no support for this
> at the moment. The biggest gap I found is that the existing classes expect
> to write/read values of same bit-width.
>
> I'm writing to get feedback from yall to see if I missed anything.
>
> Cheers,
> Tony X
>


-- 
Sincerely yours
Mikhail Khludnev

Reply via email to