Hi folks, I was recently (yesterday) pulled into a conversation with a data science colleague about sparse vector search, which led me to learn how FeatureFields work.
While it sounds like they should work to get things started, the science guy was concerned about precision on their weights. Apparently their experiments have been using 32-bit floats (with a 23 bit mantissa), whereas FeatureField is stuck with an 8-bit mantissa. In theory, we could modify FeatureField to use IEEE-754 half-precision floats (versus the current truncated single-precision floats) giving us 10 bits of mantissa. I found a thread ( https://lists.apache.org/thread/dcwm71rnjz7mz69ntwy04co3615zdl6w) from Ankur Goel from 2020 on a similar subject. Not really sure where he ended up going with it, though. (Though it sounds like his use-case might have been different.) I'm guessing the 16-bit limitation on FeatureField values is because the custom term frequencies get added to the invertState.length at https://github.com/apache/lucene/blob/b0ebaaed56505048d2c0bdfb7d71a3a082d23822/lucene/core/src/java/org/apache/lucene/index/IndexingChain.java#L1284, and we're trying to avoid overflow? Are there any other good options for sparse vector storage in Lucene? I'd love something kind of like FeatureField, but where the values are legit "payload" values stored in postings without a couple of extra hops of indirection, but also without pretending that they're term frequencies. Does anyone else have a use for something like that? Thanks! Froh