yes, it makes a difference. It will take less time and CPU to do it all in one go, producing a single segment (assuming the data does not exceed the IndexWriter RAM buffer size). If you index a lot of little segments and then force merge them it will take longer, because it had to build the graphs for the little segments, and then for the big one when merging, and it will eventually use the same amount of RAM to build the big graph, although I don't believe it will have to load the vectors en masse into RAM while merging.
On Thu, Apr 6, 2023 at 10:20 AM Michael Wechner <michael.wech...@wyona.com> wrote: > > thanks very much for these insights! > > Does it make a difference re RAM when I do a batch import, for example > import 1000 documents and close the IndexWriter and do a forceMerge or > import 1Mio documents at once? > > I would expect so, or do I misunderstand this? > > Thanks > > Michael > > > > Am 06.04.23 um 16:11 schrieb Michael Sokolov: > > re: how does this HNSW stuff scale - I think people are calling out > > indexing memory usage here, so let's discuss some facts. During > > initial indexing we hold in RAM all the vector data and the graph > > constructed from the new documents, but this is accounted for and > > limited by the size of IndexWriter's buffer; the document vectors and > > their graph will be flushed to disk when this fills up, and at search > > time, they are not read in wholesale to RAM. There is potentially > > unbounded RAM usage during merging though, because the entire merged > > graph will be built in RAM. I lost track of how we handle the vector > > data now, but at least in theory it should be fairly straightforward > > to write the merged vector data in chunks using only limited RAM. So > > how much RAM does the graph use? It uses numdocs*fanout VInts. > > Actually it doesn't really scale with the vector dimension at all - > > rather it scales with the graph fanout (M) parameter and with the > > total number of documents. So I think this focus on limiting the > > vector dimension is not helping to address the concern about RAM usage > > while merging. > > > > The vector dimension does have a strong role in the search, and > > indexing time, but the impact is linear in the dimension and won't > > exhaust any limited resource. > > > > On Thu, Apr 6, 2023 at 5:48 AM Michael McCandless > > <luc...@mikemccandless.com> wrote: > >>> We shouldn't accept weakly/not scientifically motivated vetos anyway > >>> right? > >> In fact we must accept all vetos by any committer as a veto, for a change > >> to Lucene's source code, regardless of that committer's reasoning. This > >> is the power of Apache's model. > >> > >> Of course we all can and will work together to convince one another (this > >> is where the scientifically motivated part comes in) to change our votes, > >> one way or another. > >> > >>> I'd ask anyone voting +1 to raise this limit to at least try to index a > >>> few million vectors with 756 or 1024, which is allowed today. > >> +1, if the current implementation really does not scale / needs more and > >> more RAM for merging, let's understand what's going on here, first, before > >> increasing limits. I rescind my hasty +1 for now! > >> > >> Mike McCandless > >> > >> http://blog.mikemccandless.com > >> > >> > >> On Wed, Apr 5, 2023 at 11:22 AM Alessandro Benedetti > >> <a.benede...@sease.io> wrote: > >>> Ok, so what should we do then? > >>> This space is moving fast, and in my opinion we should act fast to > >>> release and guarantee we attract as many users as possible. > >>> > >>> At the same time I am not saying we should proceed blind, if there's > >>> concrete evidence for setting a limit rather than another, or that a > >>> certain limit is detrimental to the project, I think that veto should be > >>> valid. > >>> > >>> We shouldn't accept weakly/not scientifically motivated vetos anyway > >>> right? > >>> > >>> The problem I see is that more than voting we should first decide this > >>> limit and I don't know how we can operate. > >>> I am imagining like a poll where each entry is a limit + motivation and > >>> PMCs maybe vote/add entries? > >>> > >>> Did anything similar happen in the past? How was the current limit added? > >>> > >>> > >>> On Wed, 5 Apr 2023, 14:50 Dawid Weiss, <dawid.we...@gmail.com> wrote: > >>>> > >>>>> Should create a VOTE thread, where we propose some values with a > >>>>> justification and we vote? > >>>> > >>>> Technically, a vote thread won't help much if there's no full consensus > >>>> - a single veto will make the patch unacceptable for merging. > >>>> https://www.apache.org/foundation/voting.html#Veto > >>>> > >>>> Dawid > >>>> > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org