well, it's a final variable. But you could maybe extend KnnVectorField to get around this limit? I think that's the only place it's currently enforced
On Sat, Apr 8, 2023 at 3:54 PM Ishan Chattopadhyaya <ichattopadhy...@gmail.com> wrote: > > Can the limit be raised using Java reflection at run time? Or is there more > to it that needs to be changed? > > On Sun, 9 Apr, 2023, 12:58 am Alessandro Benedetti, <a.benede...@sease.io> > wrote: >> >> I am very attentive to listen opinions but I am un-convinced here and I an >> not sure that a single person opinion should be allowed to be detrimental >> for such an important project. >> >> The limit as far as I know is literally just raising an exception. >> Removing it won't alter in any way the current performance for users in low >> dimensional space. >> Removing it will just enable more users to use Lucene. >> >> If new users in certain situations will be unhappy with the performance, >> they may contribute improvements. >> This is how you make progress. >> >> If it's a reputation thing, trust me that not allowing users to play with >> high dimensional space will equally damage it. >> >> To me it's really a no brainer. >> Removing the limit and enable people to use high dimensional vectors will >> take minutes. >> Improving the hnsw implementation can take months. >> Pick one to begin with... >> >> And there's no-one paying me here, no company interest whatsoever, actually >> I pay people to contribute, I am just convinced it's a good idea. >> >> >> On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcm...@gmail.com> wrote: >>> >>> I disagree with your categorization. I put in plenty of work and >>> experienced plenty of pain myself, writing tests and fighting these >>> issues, after i saw that, two releases in a row, vector indexing fell >>> over and hit integer overflows etc on small datasets: >>> >>> https://github.com/apache/lucene/pull/11905 >>> >>> Attacking me isn't helping the situation. >>> >>> PS: when i said the "one guy who wrote the code" I didn't mean it in >>> any kind of demeaning fashion really. I meant to describe the current >>> state of usability with respect to indexing a few million docs with >>> high dimensions. You can scroll up the thread and see that at least >>> one other committer on the project experienced similar pain as me. >>> Then, think about users who aren't committers trying to use the >>> functionality! >>> >>> On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov <msoko...@gmail.com> wrote: >>> > >>> > What you said about increasing dimensions requiring a bigger ram buffer >>> > on merge is wrong. That's the point I was trying to make. Your concerns >>> > about merge costs are not wrong, but your conclusion that we need to >>> > limit dimensions is not justified. >>> > >>> > You complain that hnsw sucks it doesn't scale, but when I show it scales >>> > linearly with dimension you just ignore that and complain about something >>> > entirely different. >>> > >>> > You demand that people run all kinds of tests to prove you wrong but when >>> > they do, you don't listen and you won't put in the work yourself or >>> > complain that it's too hard. >>> > >>> > Then you complain about people not meeting you half way. Wow >>> > >>> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcm...@gmail.com> wrote: >>> >> >>> >> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner >>> >> <michael.wech...@wyona.com> wrote: >>> >> > >>> >> > What exactly do you consider reasonable? >>> >> >>> >> Let's begin a real discussion by being HONEST about the current >>> >> status. Please put politically correct or your own company's wishes >>> >> aside, we know it's not in a good state. >>> >> >>> >> Current status is the one guy who wrote the code can set a >>> >> multi-gigabyte ram buffer and index a small dataset with 1024 >>> >> dimensions in HOURS (i didn't ask what hardware). >>> >> >>> >> My concerns are everyone else except the one guy, I want it to be >>> >> usable. Increasing dimensions just means even bigger multi-gigabyte >>> >> ram buffer and bigger heap to avoid OOM on merge. >>> >> It is also a permanent backwards compatibility decision, we have to >>> >> support it once we do this and we can't just say "oops" and flip it >>> >> back. >>> >> >>> >> It is unclear to me, if the multi-gigabyte ram buffer is really to >>> >> avoid merges because they are so slow and it would be DAYS otherwise, >>> >> or if its to avoid merges so it doesn't hit OOM. >>> >> Also from personal experience, it takes trial and error (means >>> >> experiencing OOM on merge!!!) before you get those heap values correct >>> >> for your dataset. This usually means starting over which is >>> >> frustrating and wastes more time. >>> >> >>> >> Jim mentioned some ideas about the memory usage in IndexWriter, seems >>> >> to me like its a good idea. maybe the multigigabyte ram buffer can be >>> >> avoided in this way and performance improved by writing bigger >>> >> segments with lucene's defaults. But this doesn't mean we can simply >>> >> ignore the horrors of what happens on merge. merging needs to scale so >>> >> that indexing really scales. >>> >> >>> >> At least it shouldnt spike RAM on trivial data amounts and cause OOM, >>> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2) >>> >> fashion when indexing. >>> >> >>> >> --------------------------------------------------------------------- >>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> >> For additional commands, e-mail: dev-h...@lucene.apache.org >>> >> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: dev-h...@lucene.apache.org >>> --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org