Can we set up a branch in which the limit is bumped to 2048, then have a realistic, free data set (wikipedia sample or something) that has, say, 5 million docs and vectors created using public data (glove pre-trained embeddings or the like)? We then could run indexing on the same hardware with 512, 1024 and 2048 and see what the numbers, limits and behavior actually are.
I can help in writing this but not until after Easter. Dawid On Sat, Apr 8, 2023 at 11:29 PM Adrien Grand <jpou...@gmail.com> wrote: > > As Dawid pointed out earlier on this thread, this is the rule for > Apache projects: a single -1 vote on a code change is a veto and > cannot be overridden. Furthermore, Robert is one of the people on this > project who worked the most on debugging subtle bugs, making Lucene > more robust and improving our test framework, so I'm listening when he > voices quality concerns. > > The argument against removing/raising the limit that resonates with me > the most is that it is a one-way door. As MikeS highlighted earlier on > this thread, implementations may want to take advantage of the fact > that there is a limit at some point too. This is why I don't want to > remove the limit and would prefer a slight increase, such as 2048 as > suggested in the original issue, which would enable most of the things > that users who have been asking about raising the limit would like to > do. > > I agree that the merge-time memory usage and slow indexing rate are > not great. But it's still possible to index multi-million vector > datasets with a 4GB heap without hitting OOMEs regardless of the > number of dimensions, and the feedback I'm seeing is that many users > are still interested in indexing multi-million vector datasets despite > the slow indexing rate. I wish we could do better, and vector indexing > is certainly more expert than text indexing, but it still is usable in > my opinion. I understand how giving Lucene more information about > vectors prior to indexing (e.g. clustering information as Jim pointed > out) could help make merging faster and more memory-efficient, but I > would really like to avoid making it a requirement for indexing > vectors as it also makes this feature much harder to use. > > On Sat, Apr 8, 2023 at 9:28 PM Alessandro Benedetti > <a.benede...@sease.io> wrote: > > > > I am very attentive to listen opinions but I am un-convinced here and I an > > not sure that a single person opinion should be allowed to be detrimental > > for such an important project. > > > > The limit as far as I know is literally just raising an exception. > > Removing it won't alter in any way the current performance for users in low > > dimensional space. > > Removing it will just enable more users to use Lucene. > > > > If new users in certain situations will be unhappy with the performance, > > they may contribute improvements. > > This is how you make progress. > > > > If it's a reputation thing, trust me that not allowing users to play with > > high dimensional space will equally damage it. > > > > To me it's really a no brainer. > > Removing the limit and enable people to use high dimensional vectors will > > take minutes. > > Improving the hnsw implementation can take months. > > Pick one to begin with... > > > > And there's no-one paying me here, no company interest whatsoever, actually > > I pay people to contribute, I am just convinced it's a good idea. > > > > > > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcm...@gmail.com> wrote: > >> > >> I disagree with your categorization. I put in plenty of work and > >> experienced plenty of pain myself, writing tests and fighting these > >> issues, after i saw that, two releases in a row, vector indexing fell > >> over and hit integer overflows etc on small datasets: > >> > >> https://github.com/apache/lucene/pull/11905 > >> > >> Attacking me isn't helping the situation. > >> > >> PS: when i said the "one guy who wrote the code" I didn't mean it in > >> any kind of demeaning fashion really. I meant to describe the current > >> state of usability with respect to indexing a few million docs with > >> high dimensions. You can scroll up the thread and see that at least > >> one other committer on the project experienced similar pain as me. > >> Then, think about users who aren't committers trying to use the > >> functionality! > >> > >> On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov <msoko...@gmail.com> wrote: > >> > > >> > What you said about increasing dimensions requiring a bigger ram buffer > >> > on merge is wrong. That's the point I was trying to make. Your concerns > >> > about merge costs are not wrong, but your conclusion that we need to > >> > limit dimensions is not justified. > >> > > >> > You complain that hnsw sucks it doesn't scale, but when I show it scales > >> > linearly with dimension you just ignore that and complain about > >> > something entirely different. > >> > > >> > You demand that people run all kinds of tests to prove you wrong but > >> > when they do, you don't listen and you won't put in the work yourself or > >> > complain that it's too hard. > >> > > >> > Then you complain about people not meeting you half way. Wow > >> > > >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcm...@gmail.com> wrote: > >> >> > >> >> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner > >> >> <michael.wech...@wyona.com> wrote: > >> >> > > >> >> > What exactly do you consider reasonable? > >> >> > >> >> Let's begin a real discussion by being HONEST about the current > >> >> status. Please put politically correct or your own company's wishes > >> >> aside, we know it's not in a good state. > >> >> > >> >> Current status is the one guy who wrote the code can set a > >> >> multi-gigabyte ram buffer and index a small dataset with 1024 > >> >> dimensions in HOURS (i didn't ask what hardware). > >> >> > >> >> My concerns are everyone else except the one guy, I want it to be > >> >> usable. Increasing dimensions just means even bigger multi-gigabyte > >> >> ram buffer and bigger heap to avoid OOM on merge. > >> >> It is also a permanent backwards compatibility decision, we have to > >> >> support it once we do this and we can't just say "oops" and flip it > >> >> back. > >> >> > >> >> It is unclear to me, if the multi-gigabyte ram buffer is really to > >> >> avoid merges because they are so slow and it would be DAYS otherwise, > >> >> or if its to avoid merges so it doesn't hit OOM. > >> >> Also from personal experience, it takes trial and error (means > >> >> experiencing OOM on merge!!!) before you get those heap values correct > >> >> for your dataset. This usually means starting over which is > >> >> frustrating and wastes more time. > >> >> > >> >> Jim mentioned some ideas about the memory usage in IndexWriter, seems > >> >> to me like its a good idea. maybe the multigigabyte ram buffer can be > >> >> avoided in this way and performance improved by writing bigger > >> >> segments with lucene's defaults. But this doesn't mean we can simply > >> >> ignore the horrors of what happens on merge. merging needs to scale so > >> >> that indexing really scales. > >> >> > >> >> At least it shouldnt spike RAM on trivial data amounts and cause OOM, > >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2) > >> >> fashion when indexing. > >> >> > >> >> --------------------------------------------------------------------- > >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> >> For additional commands, e-mail: dev-h...@lucene.apache.org > >> >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: dev-h...@lucene.apache.org > >> > > > -- > Adrien > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org