Forgot to close the loop on this one and was traveling last week. I've gone back to the science folks with the suggestion that we experiment with 16 bits. I suspect that you're right and 16 bits will be good enough.
It is good to know that we can play with it a bit, provided that we keep under the field length overflow. For example, if we know that there are fewer than 127 non-zero values for a given doc, I think we could expand to 24 bits. (I'm probably off-by-one in the count of values or bits.) Hopefully it won't be necessary, though. Thanks Adrien! On Fri, May 16, 2025 at 1:51 PM Adrien Grand <jpou...@gmail.com> wrote: > Your understanding is correct. For reference, using only 16 bits was > convenient to avoid overflowing the field length, but storage efficiency > was a motivation as well. Using 32 bits per frequency instead of 16 would > significantly increase the size of the inverted index. > > We could look into what it would take to support greater precision, on the > other hand it would be interesting to check if more precision is actually > needed. Historically, some people have raised concerns about e.g. how > Lucene encodes length normalization factors on a single byte, but no > benchmark that I'm aware of ever concluded that it hurts in practice. My > understanding is also that weights are commonly stored as bfloat16, which > have less precision (one less bit of mantissa) than FeatureField. > > > On Fri, May 16, 2025 at 9:30 PM Michael Froh <msf...@gmail.com> wrote: > >> Hi folks, >> >> I was recently (yesterday) pulled into a conversation with a data science >> colleague about sparse vector search, which led me to learn how >> FeatureFields work. >> >> While it sounds like they should work to get things started, the science >> guy was concerned about precision on their weights. Apparently their >> experiments have been using 32-bit floats (with a 23 bit mantissa), whereas >> FeatureField is stuck with an 8-bit mantissa. In theory, we could modify >> FeatureField to use IEEE-754 half-precision floats (versus the current >> truncated single-precision floats) giving us 10 bits of mantissa. >> >> I found a thread ( >> https://lists.apache.org/thread/dcwm71rnjz7mz69ntwy04co3615zdl6w) from >> Ankur Goel from 2020 on a similar subject. Not really sure where he ended >> up going with it, though. (Though it sounds like his use-case might have >> been different.) >> >> I'm guessing the 16-bit limitation on FeatureField values is because the >> custom term frequencies get added to the invertState.length at >> https://github.com/apache/lucene/blob/b0ebaaed56505048d2c0bdfb7d71a3a082d23822/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L1284, >> and we're trying to avoid overflow? >> >> Are there any other good options for sparse vector storage in Lucene? I'd >> love something kind of like FeatureField, but where the values are legit >> "payload" values stored in postings without a couple of extra hops of >> indirection, but also without pretending that they're term frequencies. >> Does anyone else have a use for something like that? >> >> Thanks! >> Froh >> > > > -- > Adrien >