Hello Tony Is it possible to write a block of docfreqs and then a block of postingoffsets? Or why not write them as 10-bit integers and then split to quad and sextet in the posting format code?
On Mon, Oct 16, 2023 at 11:50 PM Dongyu Xu <dongyu...@hotmail.com> wrote: > Hi devs, > > As I was working on https://github.com/apache/lucene/issues/12513 I > needed to compress positive integers which are used to locate postings etc. > > To put it concretely, I will need to pack a few values per term > contiguously and those values can have different bit-width. For example, > consider that we need to encode docFreq and postingsStartOffset per term > and docFreq takes 4 bit and the postingsStartOffset takes 6 bit. We > expect to write the following for two terms. > > ``` > Term1 | Term2 > > docFreq(4bit) | postingsStartOffset(6bit) | docFreq(4bit) | > postingsStartOffset(6bit) > > ``` > > On the read path, I expect to locate the offest for a term first and > followed by reading two values that have different bit-width. > > In the spirit of not re-inventing necessarily, I tried to explore the > existing PackedInts util classes and I believe there is no support for this > at the moment. The biggest gap I found is that the existing classes expect > to write/read values of same bit-width. > > I'm writing to get feedback from yall to see if I missed anything. > > Cheers, > Tony X > -- Sincerely yours Mikhail Khludnev