amoll75 commented on issue #16263:
URL: https://github.com/apache/lucene/issues/16263#issuecomment-4718522727

   Thanks for the feedback.
   
   I agree that the embedding model is ultimately responsible for how 
information is encoded and that Lucene itself should not introduce an arbitrary 
semantic bias.
   
   However, what we are observing in practice is that, for many real-world 
embedding models, nearest-neighbor retrieval tends to favor very short text 
segments. In our corpus, these short segments often contain less useful 
information than slightly longer passages expressing the same concept.
   
   The effect is not caused by HNSW itself, but it manifests at retrieval time 
regardless of the exact cause inside the embedding model. From an application 
perspective, the result is that highly informative passages are often ranked 
below short fragments with very similar embeddings.
   
   Your suggestion of encoding length information directly into the vector 
magnitude and using MIPS is interesting. The challenge is that it requires 
re-embedding or at least reprocessing the entire corpus and committing to a 
specific length-biasing strategy at indexing time.
   
   What motivated my proposal was the possibility of applying such corrections 
at retrieval time instead. This would allow experimentation with different 
normalization functions without rebuilding embeddings and would remain 
compatible with existing indexes.
   
   More generally, I think this could be useful beyond passage-length 
normalization. Many applications have document-level signals available at 
indexing time that are difficult or impossible to encode directly into the 
embedding itself. Examples include document recency, authority, popularity, 
quality scores, or passage length. Such signals can often be represented as a 
simple numeric factor and combined with the vector similarity score.
   
   Perhaps this is better viewed not as a Lucene bug, but as a feature request 
for optional score-adjustment mechanisms that can incorporate document-level 
statistics during vector retrieval. This would enable users to efficiently 
combine semantic similarity with additional ranking signals without requiring 
custom post-processing over large candidate sets.
   
   I'd be interested to hear whether others have observed similar ranking 
behavior with modern embedding models and large passage collections, and 
whether a generic mechanism for score normalization or score boosting based on 
stored document attributes would be considered useful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to