Hi Adrien,
I'm using the default CMS, but I doubt whether the merge will be triggered
at all in the background. Since no merge policy is changed the default TMP
will likely only merge the segments after they reach 10 I believe? But the
index is about 300M and the buffer size is around 50M so I don't think we
will have enough segments to trigger the merge when I'm building the index?

On Wed, Oct 11, 2023, 02:47 Adrien Grand <jpou...@gmail.com> wrote:

> Regarding building time, did you configure a SerialMergeScheduler?
> Otherwise merges run in separate threads, which would explain the speedup
> as adding vectors to the graph gets more and more expensive as the size of
> the graph increases.
>
> Le mer. 11 oct. 2023, 05:07, Patrick Zhai <zhai7...@gmail.com> a écrit :
>
>> Hi folks,
>> I was running the HNSW benchmark today and found some weird results. Want
>> to share it here and see whether people have any ideas.
>>
>> The set up is:
>> the 384 dimension vector that's available in luceneutil, 100k documents.
>> And lucene main branch.
>> max_conn=64, fanout=0, beam_width=250
>>
>> I first tried with the default setting where we use a 1994MB writer
>> buffer, so with 100k documents, there will be no merge happening and I will
>> have 1 segment at the end.
>> This gives me 0.755 recall and 101113ms index building time.
>>
>> Then I tried with 50MB writer buffer and then forcemerge at the last, and
>> with 100k documents, I'll get several segments (the final index is around
>> 300MB so I guess 5 or 6) before merge, and then merge them into 1 at last.
>> This gives me 0.692 recall but it took only 81562ms (including 34394ms
>> doing the merge) to index.
>> I have also tried disabling the initialize from graph feature (such that
>> when we merge we always rebuild the whole graph), or change the random
>> seed, but still get the similar result.
>>
>> I'm wondering:
>> 1. Why recall drops that much in the later setup?
>> 2. Why index time is way better? I think we still need to rebuild the
>> whole graph, or maybe it's just because we're using more off-heap memory
>> (and less heap) when merge (do we?)?
>>
>> Best
>> Patrick
>>
>

Reply via email to