Hi, MyCoy. I suppose these questions should go into dev@ list. Please join.
On Wed, Nov 2, 2022 at 12:57 AM MyCoy Z <mycoy.zh...@gmail.com> wrote: > Hi: > > I'm studying the HNSW source code and have some questions regarding > Lucene's multi-segments and HNSW. > > First, some of my understanding: > 1. While creating the index, when two segments are being merged, it could > rebuild the HNSW graph based on the docs and vectors in the two segments. > 2. But while reading the index, each segment's graph is loaded separately. > There is no way to merge multiple-graphs. > The search will iterate each segment separately. > Please let me know if there is any misunderstanding. > > > Since HNSW is a graph, the connections between the nodes could matter a > lot. > I can imagine some pros and cons here. > 1. By splitting the docs into multiple separate graphs, it could help the > diversity by retrieving more docs. > For example, if just a single graph, some docs could be too far in the > Neighbor list to be retrieved. And one way to mitigate this is, dividing > the docs into multiple graphs. > It could also help to boost the performance. > > 2. However, too many segments could cause other issues. > For example, retrieving too many irrelevant docs, especially if there > are not so many docs in a segment. > > > So, I think the number of segments and the size of the graphs could have a > real impact on the retrieving quality and performance. > > I'm wondering if there is any best practice, e.g. how many docs should be > in a single graph? > Or does anyone have some production experience to share? > > Thanks & Regards > MyCoy > -- Sincerely yours Mikhail Khludnev