nitirajrathore opened a new issue, #12627:
URL: https://github.com/apache/lucene/issues/12627

   ### Description
   
   I work for Amazon Retail Product search and we are using Lucene KNN for
   semantic search of products. 
   We recently noticed that the hnsw graphs generated are not always strongly
   connected and in worst case scenario some products may be undiscoverable.
   Connectedness of Hierarchical graph can be complicated, so below I am
   mentioning my experiment details.
   
   ### Experiment:
   I took the Lucene indexes from our production servers and for each segment
   (hnsw graph) I did following test.
   At each level graph I took the same entry point, the entry point of HNSW
   graph, checked how many nodes are reachable from this entrypoint. Note that
   connectedness at each level was checked independently of other levels.
   Sample code attached. My observations are as below.
   
   - Observation :
   1. Of all the graphs across all the segments, across 100s of indexes that I
   considered, one graph for each "level" of HNSW, almost 18% of the graphs
   had some disconnectedness.
   2. Disconnectedness was observed at all the levels of HNSW graphs. I saw we 
have
   at most 3 to 4 levels in HNSW graphs.
   3. percentage disconnectedness ranged from small fractions 0.000386% (1
   disconnected out of 259342)  to 3.7% (eg. 87 disconnected out of 2308).
   In some extreme case the entry-point(EP) in zeroth level graph was 
disconnected
   from rest of the graph making the %age disconnected as high as 99.9% (only 65
   reachable nodes from EP out of 252275). But this does not necessarily mean
   that the 99.9% of nodes were not discoverable, it just means that if
   unluckily we end up on EP in the 0th level graph for a query, there can at
   max be 65 nodes that can be reached. But had we diverted our path from EP
   to some other node in the upper level graphs then may be more nodes be
   discoverable via that node.
   
   ### Reproduciability
      Although it is fairly easy to be reproduced with Amazon's internal data, 
but I found it really hard to reproduce it with random vectors. I will attached 
a simple main class of my test and the parameters that I tried. I only got some 
disconnectedness as listed in below table. For higher number of vectors, the 
execution becomes painfully slow because of hnsw graph creation. I will 
continue to work to find some  parameters for reproducibilty and also explore 
*open source datasets* for the same.
   
   ``DIS-L0 means: %age of disconnected nodes in graph at level-0 of HNSW 
graph.``
   
   | randomness seed       | dimension | no of doc vectors      | Max-conn   | 
Beam-width | dis-l0 | dis-l1                      | dis-l2 | dis-l3 | dis-l4 |
   | ---------- | --- | --------- | --- | --------- | ------- | 
---------------------------- | ------- | ------- | ------- |
   | -219343918 | 256 | 100_000   | 16  | 100       | 0       | 0.0158 (1 node 
disconnected) | 0       | 0       | 0       |
   | -245556271 | 256 | 100_000   | 16  | 100       | 0       | 0.0158 (1 node 
disconnected) | 0       | 0       | 0       |
   | -766887769 | 256 | 1_000_000 | 16  | 100       | 0.0003  | 0.04166         
             | 0       | 0       | 0       |
   
   No direct solution comes to my mind but as per my discussion in this email 
thread. I will look into the part where some connections are removed to 
maintain the Max-Conn property (diversity check etc.).
   
   ### Version and environment details
   
   Lucene version : 9.7


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to