shubhamvishu opened a new issue, #15778:
URL: https://github.com/apache/lucene/issues/15778

   ### Description
   
   It's very likely that this has already been discussed or explored, but I 
couldn’t find any relevant GitHub issue, so I’m opening this to at least 
satisfy my curiosity and hopefully get some insights or have it struck down :P. 
    
https://github.com/apache/lucene/blob/665717296595e1a7ca208a1a68cf9bf09dda43cb/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L233-L263
 
   
   I was looking at `findBestEntryPoint` we greedily try to find the best entry 
point and stop if none of the neighbours is better than the current node. 
Intuitively, I thought that the number of nodes on the topmost layer is 
typically quite small in absolute sense (example: ~10-20 for a graph of 5M docs 
and same hols for even higher number of docs). Is it possible that, in some 
rare cases (though I don't have data to back this up), we might miss the node 
that’s closest to the query? This could result in landing somewhere farther in 
the vector space, on the bottommost layer, leading to more computation. Could 
this be avoided by simply iterating over the nodes on the topmost layer? Or, 
perhaps we could go a step further by performing an exhaustive search on a 
layer containing `X% of expectedVisitedNodes` (where X is configurable, e.g., 
<=0.1% or 1% or some heuristic so that the only top level at minimum, higher 
means trying harder for exhaustive search on next layers). 
   
   => M=64, n=5000000
   
   | Level | Probability | Expected nodes |
   |-------|------------|----------------|
   | 0 | 100% | 5,000,000 |
   | 1 | 1/64 | 78,125 |
   | 2 | 1/4,096 | 1,221 |
   | 3 | 1/262,144 | ~19 |
   
   For `n=5,000,000` and `K=100` => `expectedVisitedNodes(K, n)` = `ln(5000000) 
* 100` = `15.4249 * 100` = `1542` nodes which is `(1542 / 5000000) = 0.03% = 
~1%`
   
   I understand that we don’t want to lose the main benefit of HNSW search by 
making everything exhaustive or introducing too many knobs. The idea is to 
apply this approach only where the cost is minimal, while hopefully improving 
recall and latency by bringing the search closer to the query on the bottom 
layer. I’m also wondering if the top layer(s) are always well-connected (i.e., 
never disconnected graph?) or entry point is always accurate one on topmost 
layer? and this might cause it to not translate into actual benefits even 
though this seems intuitive to me now. I'd be happy to be proven wrong or 
experiment if it sounds reasonable.
   
   Benefit/Why do this: In some cases (perhaps adversarial ones), we could 
reduce the number of nodes visited i.e. saving latency, when not choosing the 
best candidate on the topmost layer becomes unnecessarily costly. This could 
also improve recall by bringing the search closer to the query's vicinity?. The 
idea was sparked mainly because we aren't performing an exact search on the top 
layer, where the absolute no. of nodes is very small and may be there are edge 
cases in certain vector datasets where this approach leaves some potential 
benefits on the table. I'd love to hear more thoughts on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to