Re: [Proposal] Remove max number of dimensions for KNN vectors

Michael Sokolov Sat, 08 Apr 2023 14:01:44 -0700

well, it's a final variable. But you could maybe extend KnnVectorField
to get around this limit? I think that's the only place it's currently
enforced


On Sat, Apr 8, 2023 at 3:54 PM Ishan Chattopadhyaya
<ichattopadhy...@gmail.com> wrote:
>
> Can the limit be raised using Java reflection at run time? Or is there more 
> to it that needs to be changed?
>
> On Sun, 9 Apr, 2023, 12:58 am Alessandro Benedetti, <a.benede...@sease.io> 
> wrote:
>>
>> I am very attentive to listen opinions but I am un-convinced here and I an 
>> not sure that a single person opinion should be allowed to be detrimental 
>> for such an important project.
>>
>> The limit as far as I know is literally just raising an exception.
>> Removing it won't alter in any way the current performance for users in low 
>> dimensional space.
>> Removing it will just enable more users to use Lucene.
>>
>> If new users in certain situations will be unhappy with the performance, 
>> they may contribute improvements.
>> This is how you make progress.
>>
>> If it's a reputation thing, trust me that not allowing users to play with 
>> high dimensional space will equally damage it.
>>
>> To me it's really a no brainer.
>> Removing the limit and enable people to use high dimensional vectors will 
>> take minutes.
>> Improving the hnsw implementation can take months.
>> Pick one to begin with...
>>
>> And there's no-one paying me here, no company interest whatsoever, actually 
>> I pay people to contribute, I am just convinced it's a good idea.
>>
>>
>> On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcm...@gmail.com> wrote:
>>>
>>> I disagree with your categorization. I put in plenty of work and
>>> experienced plenty of pain myself, writing tests and fighting these
>>> issues, after i saw that, two releases in a row, vector indexing fell
>>> over and hit integer overflows etc on small datasets:
>>>
>>> https://github.com/apache/lucene/pull/11905
>>>
>>> Attacking me isn't helping the situation.
>>>
>>> PS: when i said the "one guy who wrote the code" I didn't mean it in
>>> any kind of demeaning fashion really. I meant to describe the current
>>> state of usability with respect to indexing a few million docs with
>>> high dimensions. You can scroll up the thread and see that at least
>>> one other committer on the project experienced similar pain as me.
>>> Then, think about users who aren't committers trying to use the
>>> functionality!
>>>
>>> On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov <msoko...@gmail.com> wrote:
>>> >
>>> > What you said about increasing dimensions requiring a bigger ram buffer 
>>> > on merge is wrong. That's the point I was trying to make. Your concerns 
>>> > about merge costs are not wrong, but your conclusion that we need to 
>>> > limit dimensions is not justified.
>>> >
>>> > You complain that hnsw sucks it doesn't scale, but when I show it scales 
>>> > linearly with dimension you just ignore that and complain about something 
>>> > entirely different.
>>> >
>>> > You demand that people run all kinds of tests to prove you wrong but when 
>>> > they do, you don't listen and you won't put in the work yourself or 
>>> > complain that it's too hard.
>>> >
>>> > Then you complain about people not meeting you half way. Wow
>>> >
>>> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcm...@gmail.com> wrote:
>>> >>
>>> >> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner
>>> >> <michael.wech...@wyona.com> wrote:
>>> >> >
>>> >> > What exactly do you consider reasonable?
>>> >>
>>> >> Let's begin a real discussion by being HONEST about the current
>>> >> status. Please put politically correct or your own company's wishes
>>> >> aside, we know it's not in a good state.
>>> >>
>>> >> Current status is the one guy who wrote the code can set a
>>> >> multi-gigabyte ram buffer and index a small dataset with 1024
>>> >> dimensions in HOURS (i didn't ask what hardware).
>>> >>
>>> >> My concerns are everyone else except the one guy, I want it to be
>>> >> usable. Increasing dimensions just means even bigger multi-gigabyte
>>> >> ram buffer and bigger heap to avoid OOM on merge.
>>> >> It is also a permanent backwards compatibility decision, we have to
>>> >> support it once we do this and we can't just say "oops" and flip it
>>> >> back.
>>> >>
>>> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
>>> >> avoid merges because they are so slow and it would be DAYS otherwise,
>>> >> or if its to avoid merges so it doesn't hit OOM.
>>> >> Also from personal experience, it takes trial and error (means
>>> >> experiencing OOM on merge!!!) before you get those heap values correct
>>> >> for your dataset. This usually means starting over which is
>>> >> frustrating and wastes more time.
>>> >>
>>> >> Jim mentioned some ideas about the memory usage in IndexWriter, seems
>>> >> to me like its a good idea. maybe the multigigabyte ram buffer can be
>>> >> avoided in this way and performance improved by writing bigger
>>> >> segments with lucene's defaults. But this doesn't mean we can simply
>>> >> ignore the horrors of what happens on merge. merging needs to scale so
>>> >> that indexing really scales.
>>> >>
>>> >> At least it shouldnt spike RAM on trivial data amounts and cause OOM,
>>> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>>> >> fashion when indexing.
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>>> >>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors

Reply via email to