Heya Patrick, What version of Lucene Util are you using? There was a bug where `forceMerge` was not actually using your configured maxConn & beamWidth. See: https://github.com/mikemccand/luceneutil/pull/232
Do you have that commit and rebuilt the KnnGraphTester? On Wed, Oct 11, 2023 at 10:10 AM Patrick Zhai <zhai7...@gmail.com> wrote: > Hi Adrien, > I'm using the default CMS, but I doubt whether the merge will be triggered > at all in the background. Since no merge policy is changed the default TMP > will likely only merge the segments after they reach 10 I believe? But the > index is about 300M and the buffer size is around 50M so I don't think we > will have enough segments to trigger the merge when I'm building the index? > > On Wed, Oct 11, 2023, 02:47 Adrien Grand <jpou...@gmail.com> wrote: > >> Regarding building time, did you configure a SerialMergeScheduler? >> Otherwise merges run in separate threads, which would explain the speedup >> as adding vectors to the graph gets more and more expensive as the size of >> the graph increases. >> >> Le mer. 11 oct. 2023, 05:07, Patrick Zhai <zhai7...@gmail.com> a écrit : >> >>> Hi folks, >>> I was running the HNSW benchmark today and found some weird results. >>> Want to share it here and see whether people have any ideas. >>> >>> The set up is: >>> the 384 dimension vector that's available in luceneutil, 100k documents. >>> And lucene main branch. >>> max_conn=64, fanout=0, beam_width=250 >>> >>> I first tried with the default setting where we use a 1994MB writer >>> buffer, so with 100k documents, there will be no merge happening and I will >>> have 1 segment at the end. >>> This gives me 0.755 recall and 101113ms index building time. >>> >>> Then I tried with 50MB writer buffer and then forcemerge at the last, >>> and with 100k documents, I'll get several segments (the final index is >>> around 300MB so I guess 5 or 6) before merge, and then merge them into 1 at >>> last. >>> This gives me 0.692 recall but it took only 81562ms (including 34394ms >>> doing the merge) to index. >>> I have also tried disabling the initialize from graph feature (such that >>> when we merge we always rebuild the whole graph), or change the random >>> seed, but still get the similar result. >>> >>> I'm wondering: >>> 1. Why recall drops that much in the later setup? >>> 2. Why index time is way better? I think we still need to rebuild the >>> whole graph, or maybe it's just because we're using more off-heap memory >>> (and less heap) when merge (do we?)? >>> >>> Best >>> Patrick >>> >>