shubhamvishu opened a new issue, #15778:
URL: https://github.com/apache/lucene/issues/15778
### Description
It's very likely that this has already been discussed or explored, but I
couldn’t find any relevant GitHub issue, so I’m opening this to at least
satisfy my curiosity and hopefully get some insights or have it struck down :P.
https://github.com/apache/lucene/blob/665717296595e1a7ca208a1a68cf9bf09dda43cb/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L233-L263
I was looking at `findBestEntryPoint` we greedily try to find the best entry
point and stop if none of the neighbours is better than the current node.
Intuitively, I thought that the number of nodes on the topmost layer is
typically quite small in absolute sense (example: ~10-20 for a graph of 5M docs
and same hols for even higher number of docs). Is it possible that, in some
rare cases (though I don't have data to back this up), we might miss the node
that’s closest to the query? This could result in landing somewhere farther in
the vector space, on the bottommost layer, leading to more computation. Could
this be avoided by simply iterating over the nodes on the topmost layer? Or,
perhaps we could go a step further by performing an exhaustive search on a
layer containing `X% of expectedVisitedNodes` (where X is configurable, e.g.,
<=0.1% or 1% or some heuristic so that the only top level at minimum, higher
means trying harder for exhaustive search on next layers).
=> M=64, n=5000000
| Level | Probability | Expected nodes |
|-------|------------|----------------|
| 0 | 100% | 5,000,000 |
| 1 | 1/64 | 78,125 |
| 2 | 1/4,096 | 1,221 |
| 3 | 1/262,144 | ~19 |
For `n=5,000,000` and `K=100` => `expectedVisitedNodes(K, n)` = `ln(5000000)
* 100` = `15.4249 * 100` = `1542` nodes which is `(1542 / 5000000) = 0.03% =
~1%`
I understand that we don’t want to lose the main benefit of HNSW search by
making everything exhaustive or introducing too many knobs. The idea is to
apply this approach only where the cost is minimal, while hopefully improving
recall and latency by bringing the search closer to the query on the bottom
layer. I’m also wondering if the top layer(s) are always well-connected (i.e.,
never disconnected graph?) or entry point is always accurate one on topmost
layer? and this might cause it to not translate into actual benefits even
though this seems intuitive to me now. I'd be happy to be proven wrong or
experiment if it sounds reasonable.
Benefit/Why do this: In some cases (perhaps adversarial ones), we could
reduce the number of nodes visited i.e. saving latency, when not choosing the
best candidate on the topmost layer becomes unnecessarily costly. This could
also improve recall by bringing the search closer to the query's vicinity?. The
idea was sparked mainly because we aren't performing an exact search on the top
layer, where the absolute no. of nodes is very small and may be there are edge
cases in certain vector datasets where this approach leaves some potential
benefits on the table. I'd love to hear more thoughts on this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]