Your understanding is correct. For reference, using only 16 bits was
convenient to avoid overflowing the field length, but storage efficiency
was a motivation as well. Using 32 bits per frequency instead of 16 would
significantly increase the size of the inverted index.

We could look into what it would take to support greater precision, on the
other hand it would be interesting to check if more precision is actually
needed. Historically, some people have raised concerns about e.g. how
Lucene encodes length normalization factors on a single byte, but no
benchmark that I'm aware of ever concluded that it hurts in practice. My
understanding is also that weights are commonly stored as bfloat16, which
have less precision (one less bit of mantissa) than FeatureField.


On Fri, May 16, 2025 at 9:30 PM Michael Froh <msf...@gmail.com> wrote:

> Hi folks,
>
> I was recently (yesterday) pulled into a conversation with a data science
> colleague about sparse vector search, which led me to learn how
> FeatureFields work.
>
> While it sounds like they should work to get things started, the science
> guy was concerned about precision on their weights. Apparently their
> experiments have been using 32-bit floats (with a 23 bit mantissa), whereas
> FeatureField is stuck with an 8-bit mantissa. In theory, we could modify
> FeatureField to use IEEE-754 half-precision floats (versus the current
> truncated single-precision floats) giving us 10 bits of mantissa.
>
> I found a thread (
> https://lists.apache.org/thread/dcwm71rnjz7mz69ntwy04co3615zdl6w) from
> Ankur Goel from 2020 on a similar subject. Not really sure where he ended
> up going with it, though. (Though it sounds like his use-case might have
> been different.)
>
> I'm guessing the 16-bit limitation on FeatureField values is because the
> custom term frequencies get added to the invertState.length at
> https://github.com/apache/lucene/blob/b0ebaaed56505048d2c0bdfb7d71a3a082d23822/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L1284,
> and we're trying to avoid overflow?
>
> Are there any other good options for sparse vector storage in Lucene? I'd
> love something kind of like FeatureField, but where the values are legit
> "payload" values stored in postings without a couple of extra hops of
> indirection, but also without pretending that they're term frequencies.
> Does anyone else have a use for something like that?
>
> Thanks!
> Froh
>


-- 
Adrien

Reply via email to