wangyanbn opened a new issue, #12263:
URL: https://github.com/apache/lucene/issues/12263

   ### Description
   
   When we choose the byte instead of float vector, we only wish to use less 
memory space, and we expect the bytes dot product scores are same as the float 
dot product scores. When we use byte, we usually multiply every item in 
normalized float vector by 128 ( and will ensure the max value <= 127 ). I 
think this is the common use case for choosing byte vector.
   
   In 
[VectorUtil.dotProductScore](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/VectorUtil.java#L268),
 the `denom` multiply the array length `a.length`, which makes the  byte vector 
dot product score very different from the float vector score. 
   ```
    // divide by 2 * 2^14 (maximum absolute value of product of 2 signed bytes) 
* len
    float denom = (float) (a.length * (1 << 15));
    return 0.5f + dotProduct(a, b) / denom;
   ```
   
   When the vector length is very large, such as 768 or 1024, the byte vector 
dot product scores are all near 0.50. If we show this score in the UI (such as 
in an image search app), it may confuse the user.
   
   And in hybrid retrieval use case, this byte/float score difference will 
affect the order of documents. When search with both normal query and knn 
vector search, the document score will be `knn_score*knn_boost + 
query_score*search_boost` (this is the case in ElasticSearch). Because the byte 
vector scores are very near for high dimension vector, they nearly have no 
effect on hybrid scores. The documents order may be different in byte vector 
and float vector., which is not what we expect.
   
   In 
[VectorUtil.dotProductScore](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/VectorUtil.java#L268)
 If `denom` does not multiply the array length, the byte dot product score will 
same as the float dot product score.
   So,hall we change byte dot product score logic? 
   Thanks a lot!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to