nitirajrathore opened a new issue, #12627: URL: https://github.com/apache/lucene/issues/12627
### Description I work for Amazon Retail Product search and we are using Lucene KNN for semantic search of products. We recently noticed that the hnsw graphs generated are not always strongly connected and in worst case scenario some products may be undiscoverable. Connectedness of Hierarchical graph can be complicated, so below I am mentioning my experiment details. ### Experiment: I took the Lucene indexes from our production servers and for each segment (hnsw graph) I did following test. At each level graph I took the same entry point, the entry point of HNSW graph, checked how many nodes are reachable from this entrypoint. Note that connectedness at each level was checked independently of other levels. Sample code attached. My observations are as below. - Observation : 1. Of all the graphs across all the segments, across 100s of indexes that I considered, one graph for each "level" of HNSW, almost 18% of the graphs had some disconnectedness. 2. Disconnectedness was observed at all the levels of HNSW graphs. I saw we have at most 3 to 4 levels in HNSW graphs. 3. percentage disconnectedness ranged from small fractions 0.000386% (1 disconnected out of 259342) to 3.7% (eg. 87 disconnected out of 2308). In some extreme case the entry-point(EP) in zeroth level graph was disconnected from rest of the graph making the %age disconnected as high as 99.9% (only 65 reachable nodes from EP out of 252275). But this does not necessarily mean that the 99.9% of nodes were not discoverable, it just means that if unluckily we end up on EP in the 0th level graph for a query, there can at max be 65 nodes that can be reached. But had we diverted our path from EP to some other node in the upper level graphs then may be more nodes be discoverable via that node. ### Reproduciability Although it is fairly easy to be reproduced with Amazon's internal data, but I found it really hard to reproduce it with random vectors. I will attached a simple main class of my test and the parameters that I tried. I only got some disconnectedness as listed in below table. For higher number of vectors, the execution becomes painfully slow because of hnsw graph creation. I will continue to work to find some parameters for reproducibilty and also explore *open source datasets* for the same. ``DIS-L0 means: %age of disconnected nodes in graph at level-0 of HNSW graph.`` | randomness seed | dimension | no of doc vectors | Max-conn | Beam-width | dis-l0 | dis-l1 | dis-l2 | dis-l3 | dis-l4 | | ---------- | --- | --------- | --- | --------- | ------- | ---------------------------- | ------- | ------- | ------- | | -219343918 | 256 | 100_000 | 16 | 100 | 0 | 0.0158 (1 node disconnected) | 0 | 0 | 0 | | -245556271 | 256 | 100_000 | 16 | 100 | 0 | 0.0158 (1 node disconnected) | 0 | 0 | 0 | | -766887769 | 256 | 1_000_000 | 16 | 100 | 0.0003 | 0.04166 | 0 | 0 | 0 | No direct solution comes to my mind but as per my discussion in this email thread. I will look into the part where some connections are removed to maintain the Max-Conn property (diversity check etc.). ### Version and environment details Lucene version : 9.7 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org