Forgot to close the loop on this one and was traveling last week.

I've gone back to the science folks with the suggestion that we experiment
with 16 bits. I suspect that you're right and 16 bits will be good enough.

It is good to know that we can play with it a bit, provided that we keep
under the field length overflow. For example, if we know that there are
fewer than 127 non-zero values for a given doc, I think we could expand to
24 bits. (I'm probably off-by-one in the count of values or bits.)
Hopefully it won't be necessary, though.

Thanks Adrien!

On Fri, May 16, 2025 at 1:51 PM Adrien Grand <jpou...@gmail.com> wrote:

> Your understanding is correct. For reference, using only 16 bits was
> convenient to avoid overflowing the field length, but storage efficiency
> was a motivation as well. Using 32 bits per frequency instead of 16 would
> significantly increase the size of the inverted index.
>
> We could look into what it would take to support greater precision, on the
> other hand it would be interesting to check if more precision is actually
> needed. Historically, some people have raised concerns about e.g. how
> Lucene encodes length normalization factors on a single byte, but no
> benchmark that I'm aware of ever concluded that it hurts in practice. My
> understanding is also that weights are commonly stored as bfloat16, which
> have less precision (one less bit of mantissa) than FeatureField.
>
>
> On Fri, May 16, 2025 at 9:30 PM Michael Froh <msf...@gmail.com> wrote:
>
>> Hi folks,
>>
>> I was recently (yesterday) pulled into a conversation with a data science
>> colleague about sparse vector search, which led me to learn how
>> FeatureFields work.
>>
>> While it sounds like they should work to get things started, the science
>> guy was concerned about precision on their weights. Apparently their
>> experiments have been using 32-bit floats (with a 23 bit mantissa), whereas
>> FeatureField is stuck with an 8-bit mantissa. In theory, we could modify
>> FeatureField to use IEEE-754 half-precision floats (versus the current
>> truncated single-precision floats) giving us 10 bits of mantissa.
>>
>> I found a thread (
>> https://lists.apache.org/thread/dcwm71rnjz7mz69ntwy04co3615zdl6w) from
>> Ankur Goel from 2020 on a similar subject. Not really sure where he ended
>> up going with it, though. (Though it sounds like his use-case might have
>> been different.)
>>
>> I'm guessing the 16-bit limitation on FeatureField values is because the
>> custom term frequencies get added to the invertState.length at
>> https://github.com/apache/lucene/blob/b0ebaaed56505048d2c0bdfb7d71a3a082d23822/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L1284,
>> and we're trying to avoid overflow?
>>
>> Are there any other good options for sparse vector storage in Lucene? I'd
>> love something kind of like FeatureField, but where the values are legit
>> "payload" values stored in postings without a couple of extra hops of
>> indirection, but also without pretending that they're term frequencies.
>> Does anyone else have a use for something like that?
>>
>> Thanks!
>> Froh
>>
>
>
> --
> Adrien
>

Reply via email to