amoll75 opened a new issue, #16263: URL: https://github.com/apache/lucene/issues/16263
### Description We have been using Solr's vector search successfully in production for about two years. During this time, we have observed a recurring issue: The highest-ranked results returned by vector search are very often dominated by embeddings generated from very short text segments. The underlying reason appears to be that a short query or query concept has a much higher probability of matching a short segment well than a longer segment containing the same information. However, these short matches typically carry very little informational value. In practice, the truly relevant and useful results often appear much further down the ranking. We have been able to mitigate this issue by retrieving a significantly larger topK candidate set and then performing an external score normalization step based on the square root of the embedding length. This substantially improves ranking quality for our use case. Unfortunately, this approach is suboptimal from a performance perspective because it requires collecting many more candidates than are actually needed. I suspect that other users may be affected by the same phenomenon without necessarily understanding its root cause, resulting in noticeably worse retrieval quality. One possible workaround is to address the issue during embedding generation by artificially padding short texts before computing embeddings. However, this approach is much less flexible and may introduce new problems, including reducing the discoverability of genuinely relevant short segments. Therefore, I would like to propose adding support for length-aware score normalization directly within Lucene, if this can be achieved with reasonable implementation effort. Conceptually, such a normalization step could be applied during or after HNSW traversal, allowing rankings to account for embedding length without requiring large candidate expansions and external reranking. It would be interesting to hear whether others have observed similar behavior and whether such a feature would be considered a reasonable enhancement to Lucene's vector search infrastructure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
