Heya Patrick,

What version of Lucene Util are you using? There was a bug where
`forceMerge` was not actually using your configured maxConn & beamWidth.
See: https://github.com/mikemccand/luceneutil/pull/232

Do you have that commit and rebuilt the KnnGraphTester?

On Wed, Oct 11, 2023 at 10:10 AM Patrick Zhai <zhai7...@gmail.com> wrote:

> Hi Adrien,
> I'm using the default CMS, but I doubt whether the merge will be triggered
> at all in the background. Since no merge policy is changed the default TMP
> will likely only merge the segments after they reach 10 I believe? But the
> index is about 300M and the buffer size is around 50M so I don't think we
> will have enough segments to trigger the merge when I'm building the index?
>
> On Wed, Oct 11, 2023, 02:47 Adrien Grand <jpou...@gmail.com> wrote:
>
>> Regarding building time, did you configure a SerialMergeScheduler?
>> Otherwise merges run in separate threads, which would explain the speedup
>> as adding vectors to the graph gets more and more expensive as the size of
>> the graph increases.
>>
>> Le mer. 11 oct. 2023, 05:07, Patrick Zhai <zhai7...@gmail.com> a écrit :
>>
>>> Hi folks,
>>> I was running the HNSW benchmark today and found some weird results.
>>> Want to share it here and see whether people have any ideas.
>>>
>>> The set up is:
>>> the 384 dimension vector that's available in luceneutil, 100k documents.
>>> And lucene main branch.
>>> max_conn=64, fanout=0, beam_width=250
>>>
>>> I first tried with the default setting where we use a 1994MB writer
>>> buffer, so with 100k documents, there will be no merge happening and I will
>>> have 1 segment at the end.
>>> This gives me 0.755 recall and 101113ms index building time.
>>>
>>> Then I tried with 50MB writer buffer and then forcemerge at the last,
>>> and with 100k documents, I'll get several segments (the final index is
>>> around 300MB so I guess 5 or 6) before merge, and then merge them into 1 at
>>> last.
>>> This gives me 0.692 recall but it took only 81562ms (including 34394ms
>>> doing the merge) to index.
>>> I have also tried disabling the initialize from graph feature (such that
>>> when we merge we always rebuild the whole graph), or change the random
>>> seed, but still get the similar result.
>>>
>>> I'm wondering:
>>> 1. Why recall drops that much in the later setup?
>>> 2. Why index time is way better? I think we still need to rebuild the
>>> whole graph, or maybe it's just because we're using more off-heap memory
>>> (and less heap) when merge (do we?)?
>>>
>>> Best
>>> Patrick
>>>
>>

Reply via email to