iprithv opened a new issue, #16211:
URL: https://github.com/apache/lucene/issues/16211

   ### Description
   
   #13288 added codec-level support for bit vectors via `HnswBitVectorsFormat` 
and `FlatBitVectorsScorer` in the codecs module. there's no document field for 
bit vectors and no similarity function that honestly represents Hamming 
distance.
   
   if we want to index binary embeddings:
   1. Manually pack bits into `byte[]`
   2. Use `KnnByteVectorField` with an arbitrary `VectorSimilarityFunction` 
(e.g. `DOT_PRODUCT`) that the codec silently ignores
   3. Configure `HnswBitVectorsFormat` which internally uses 
`FlatBitVectorsScorer` regardless of what similarity the field declares
   
   This works mechanically but the API, declared similarity doesn't match the 
actual scoring.
   
   can we add,
   
   1. `VectorSimilarityFunction.HAMMING` - Hamming distance for byte-encoded 
bit vectors. Score = `(totalBits - hammingDistance) / totalBits`, producing [0, 
1]. Float vectors throw `UnsupportedOperationException`.
   
   2. `KnnBitVectorField` - a document field for packed bit vectors that uses 
HAMMING similarity.
   
   3. Validation in `FieldInfo` rejecting HAMMING + FLOAT32 (nonsensical 
combination).
   
   qq:
   1. Should HAMMING live in `VectorSimilarityFunction` (core enum) or be 
handled differently?
   Adding to the core enum means every `FlatVectorsScorer`, quantized scorer, 
and memory-segment scorer needs to handle it (even if just to reject it). It 
also affects any external code with exhaustive switches over the enum. The 
alternative would be a separate mechanism, but that seems like it would create 
a parallel API for what is fundamentally the same concept.
   
   2. Dimension semantics, bytes or bits?
   `KnnByteVectorField` reports `vectorDimension()` as the number of bytes. For 
bit vectors, each byte packs 8 bits, so a 128-bit embedding would report 
dimension=16. This is confusing but consistent with how the codec layer works. 
Should `KnnBitVectorField` report the bit count instead, or keep the byte count 
to match the storage layer?
   
   3. Should `KnnBitVectorField` extend `KnnByteVectorField`?
   Extending it gives free compatibility with `IndexingChain` and 
`KnnByteVectorQuery`. But the superclass has byte-vector-specific assumptions 
(e.g. cosine zero-vector checks, Javadoc saying "each byte represents a vector 
dimension"). A standalone class extending `Field` would be cleaner semantically 
but requires more plumbing.
   
   4. Backward compatibility, the similarity ordinal is persisted in two places
   The similarity function is written as an ordinal in both:
   - `.fnm` file: `Lucene94FieldInfosFormat.writeByte(distFuncToOrd(...))`
   - `.vem` file: `Lucene99HnswVectorsWriter.writeInt(distFuncToOrd(...))`
   Adding HAMMING (ordinal 4) means old readers hit 
`IllegalArgumentException("invalid distance function: 4")` when reading 
segments written with this code. Options:
   - (a) Bump format versions so old readers get a clean "unsupported version" 
message
   - (b) Accept the raw ordinal error as the backward-compat behavior
   - (c) Only allow HAMMING with codecs that handle it internally, avoiding the 
need to persist it
   
   (a) for `Lucene99HnswVectorsFormat`. Is this the right call? And should 
`Lucene94FieldInfosFormat` also get a version bump?
   
   5. Test framework impact
   `BaseKnnVectorsFormatTestCase.randomSimilarity()` currently returns any 
similarity. With HAMMING, float vector tests would randomly hit the 
FLOAT32+HAMMING rejection. exclude HAMMING from `randomSimilarity()` and test 
it separately in bit-vector-specific tests. Is this acceptable, or should the 
base test framework be made encoding-aware?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to