Thank you both for your ideas. I sensed that there could be potential confusions. Let me try to clear them first -- I'm not changing how postings are encoded. Rather I'm changing how the postings's metadata (in code, this will be the IntBlockTermState<https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90PostingsFormat.java#L443>) are encoded in the term dictionary.
As a useful example, here are the data that we need to encode for a field that has "everything" enabled (doc, freq, positions, payload and offsets) * docStartFp: the posting file pointer * skipOffset: pointer to skip data * posStartFp: positions file pointer * payStartFp: payload file pointer * docFreq: number of docs that has this term * totalTermFeq: total freq of this term across all docs * singletonDocId: when a term just has a single posting, we store the docid directly * lastPosBlockOffset: the relative offset of the last position block relative to posStartFp > Is it possible to write a block of docfreqs and then a block of > postingoffsets? Yes, but that is less desirable since it will lose locality. It'd be multiple seeks instead of one. Plus, the example I used is rather simplified. There are more than two type of values that we need to deal with. > why not write them as 10-bit integers and then split to quad and sextet in > the posting format code? I've considered this. It is actually a decent workaround but comes with its limitation. As mentioned there could be multiple type of values and in total they may exceed 64 bits. Adrien: The docFreq distribution is interesting. bit-packing can be a good start, I think? To the rest of your point, I do need random access within the block of terms as its main usage is to back the term dictionary. Thanks, Tony ________________________________ From: Adrien Grand <jpou...@gmail.com> Sent: Tuesday, October 17, 2023 1:26 AM To: dev@lucene.apache.org <dev@lucene.apache.org> Subject: Re: PackedInts functionalities +1 to what Mikhail wrote, this is e.g. how postings work: instead of interleaving doc IDs and frequencies, they always store a block of 128 doc IDs followed by a block of 128 frequencies. For reference, bit packing feels space-inefficient for this kind of data. I would expect docFreqs to have a zipfian distribution, so you would end up using a number of bits per docFreq that is driven by the highest docFreq in the block while most values might be very low. Do you need random-access into these doc freqs and postings start offsets or will you decode data for an entire block every time anyway? On Tue, Oct 17, 2023 at 8:39 AM Mikhail Khludnev <m...@apache.org<mailto:m...@apache.org>> wrote: Hello Tony Is it possible to write a block of docfreqs and then a block of postingoffsets? Or why not write them as 10-bit integers and then split to quad and sextet in the posting format code? On Mon, Oct 16, 2023 at 11:50 PM Dongyu Xu <dongyu...@hotmail.com<mailto:dongyu...@hotmail.com>> wrote: Hi devs, As I was working on https://github.com/apache/lucene/issues/12513 I needed to compress positive integers which are used to locate postings etc. To put it concretely, I will need to pack a few values per term contiguously and those values can have different bit-width. For example, consider that we need to encode docFreq and postingsStartOffset per term and docFreq takes 4 bit and the postingsStartOffset takes 6 bit. We expect to write the following for two terms. ``` Term1 | Term2 docFreq(4bit) | postingsStartOffset(6bit) | docFreq(4bit) | postingsStartOffset(6bit) ``` On the read path, I expect to locate the offest for a term first and followed by reading two values that have different bit-width. In the spirit of not re-inventing necessarily, I tried to explore the existing PackedInts util classes and I believe there is no support for this at the moment. The biggest gap I found is that the existing classes expect to write/read values of same bit-width. I'm writing to get feedback from yall to see if I missed anything. Cheers, Tony X -- Sincerely yours Mikhail Khludnev -- Adrien