Thank you both for your ideas.

I sensed that there could be potential confusions. Let me try to clear them 
first --
I'm not changing how postings are encoded. Rather I'm changing how the 
postings's metadata (in code, this will be the 
IntBlockTermState<https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90PostingsFormat.java#L443>​)
 are encoded in the term dictionary.

As a useful example, here are the data that we need to encode for a field that 
has "everything" enabled (doc, freq, positions, payload and offsets)

  *   docStartFp: the posting file pointer
  *   skipOffset: pointer to skip data
  *   posStartFp: positions file pointer
  *   payStartFp: payload file pointer
  *   docFreq: number of docs that has this term
  *   totalTermFeq: total freq of this term across all docs
  *   singletonDocId: when a term just has a single posting, we store the docid 
directly
  *   lastPosBlockOffset: the relative offset of the last position block 
relative to posStartFp


> Is it possible to write a block of docfreqs and then a block of 
> postingoffsets?

Yes, but that is less desirable since it will lose locality. It'd be multiple 
seeks instead of one. Plus, the example I used is rather simplified. There are 
more than two type of values that we need to deal with.


> why not write them as 10-bit integers and then split to quad and sextet in 
> the posting format code?

I've considered this. It is actually a decent workaround but comes with its 
limitation. As mentioned there could be multiple type of values and in total 
they may exceed 64 bits.


Adrien: The docFreq distribution is interesting. bit-packing can be a good 
start, I think? To the rest of your point, I do need random access within the 
block of terms as its main usage is to back the term dictionary.



Thanks,
Tony

________________________________
From: Adrien Grand <jpou...@gmail.com>
Sent: Tuesday, October 17, 2023 1:26 AM
To: dev@lucene.apache.org <dev@lucene.apache.org>
Subject: Re: PackedInts functionalities

+1 to what Mikhail wrote, this is e.g. how postings work: instead of 
interleaving doc IDs and frequencies, they always store a block of 128 doc IDs 
followed by a block of 128 frequencies.

For reference, bit packing feels space-inefficient for this kind of data. I 
would expect docFreqs to have a zipfian distribution, so you would end up using 
a number of bits per docFreq that is driven by the highest docFreq in the block 
while most values might be very low. Do you need random-access into these doc 
freqs and postings start offsets or will you decode data for an entire block 
every time anyway?


On Tue, Oct 17, 2023 at 8:39 AM Mikhail Khludnev 
<m...@apache.org<mailto:m...@apache.org>> wrote:
Hello Tony
Is it possible to write a block of docfreqs and then a block of postingoffsets?
Or why not write them as 10-bit integers and then split to quad and sextet in 
the posting format code?

On Mon, Oct 16, 2023 at 11:50 PM Dongyu Xu 
<dongyu...@hotmail.com<mailto:dongyu...@hotmail.com>> wrote:
Hi devs,

As I was working on https://github.com/apache/lucene/issues/12513 I needed to 
compress positive integers which are used to locate postings etc.

To put it concretely, I will need to pack a few values per term contiguously 
and those values can have different bit-width. For example, consider that we 
need to encode docFreq and postingsStartOffset per term and docFreq takes 4 bit 
and the postingsStartOffset takes 6 bit. We expect to write the following for 
two terms.

```
Term1                                                      |     Term2

docFreq(4bit) | postingsStartOffset(6bit) | docFreq(4bit) | 
postingsStartOffset(6bit)

```

On the read path, I expect to locate the offest for a term first and followed 
by reading two values that have different bit-width.

In the spirit of not re-inventing necessarily, I tried to explore the existing 
PackedInts util classes and I believe there is no support for this at the 
moment. The biggest gap I found is that the existing classes expect to 
write/read values of same bit-width.

I'm writing to get feedback from yall to see if I missed anything.

Cheers,
Tony X


--
Sincerely yours
Mikhail Khludnev


--
Adrien

Reply via email to