Hi folks,

I was recently (yesterday) pulled into a conversation with a data science
colleague about sparse vector search, which led me to learn how
FeatureFields work.

While it sounds like they should work to get things started, the science
guy was concerned about precision on their weights. Apparently their
experiments have been using 32-bit floats (with a 23 bit mantissa), whereas
FeatureField is stuck with an 8-bit mantissa. In theory, we could modify
FeatureField to use IEEE-754 half-precision floats (versus the current
truncated single-precision floats) giving us 10 bits of mantissa.

I found a thread (
https://lists.apache.org/thread/dcwm71rnjz7mz69ntwy04co3615zdl6w) from
Ankur Goel from 2020 on a similar subject. Not really sure where he ended
up going with it, though. (Though it sounds like his use-case might have
been different.)

I'm guessing the 16-bit limitation on FeatureField values is because the
custom term frequencies get added to the invertState.length at
https://github.com/apache/lucene/blob/b0ebaaed56505048d2c0bdfb7d71a3a082d23822/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L1284,
and we're trying to avoid overflow?

Are there any other good options for sparse vector storage in Lucene? I'd
love something kind of like FeatureField, but where the values are legit
"payload" values stored in postings without a couple of extra hops of
indirection, but also without pretending that they're term frequencies.
Does anyone else have a use for something like that?

Thanks!
Froh

Reply via email to