[I] Bias Towards Short Text Segments in Vector Search Results [lucene]

via GitHub Tue, 16 Jun 2026 01:53:36 -0700


amoll75 opened a new issue, #16263:
URL: https://github.com/apache/lucene/issues/16263


   ### Description
   
   We have been using Solr's vector search successfully in production for about 
two years. During this time, we have observed a recurring issue:
   
   The highest-ranked results returned by vector search are very often 
dominated by embeddings generated from very short text segments. The underlying 
reason appears to be that a short query or query concept has a much higher 
probability of matching a short segment well than a longer segment containing 
the same information.
   
   However, these short matches typically carry very little informational 
value. In practice, the truly relevant and useful results often appear much 
further down the ranking.
   
   We have been able to mitigate this issue by retrieving a significantly 
larger topK candidate set and then performing an external score normalization 
step based on the square root of the embedding length. This substantially 
improves ranking quality for our use case.
   
   Unfortunately, this approach is suboptimal from a performance perspective 
because it requires collecting many more candidates than are actually needed.
   
   I suspect that other users may be affected by the same phenomenon without 
necessarily understanding its root cause, resulting in noticeably worse 
retrieval quality.
   
   One possible workaround is to address the issue during embedding generation 
by artificially padding short texts before computing embeddings. However, this 
approach is much less flexible and may introduce new problems, including 
reducing the discoverability of genuinely relevant short segments.
   
   Therefore, I would like to propose adding support for length-aware score 
normalization directly within Lucene, if this can be achieved with reasonable 
implementation effort. Conceptually, such a normalization step could be applied 
during or after HNSW traversal, allowing rankings to account for embedding 
length without requiring large candidate expansions and external reranking.
   
   It would be interesting to hear whether others have observed similar 
behavior and whether such a feature would be considered a reasonable 
enhancement to Lucene's vector search infrastructure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Bias Towards Short Text Segments in Vector Search Results [lucene]

Reply via email to