Re: [Proposal] Remove max number of dimensions for KNN vectors

Gus Heck Sun, 09 Apr 2023 09:46:41 -0700

What I see so far:

   1. Much positive support for raising the limit
   2. Slightly less support for removing it or making it configurable
   3. A single veto which argues that a (as yet undefined) performance
   standard must be met before raising the limit
   4. Hot tempers (various) making this discussion difficult


As I understand it, vetoes must have technical merit. I'm not sure that
this veto rises to "technical merit" on 2 counts:

   1. No standard for the performance is given so it cannot be technically
   met. Without hard criteria it's a moving target.
   2. It appears to encode a valuation of the user's time, and that
   valuation is really up to the user. Some users may consider 2hours useless
   and not worth it, and others might happily wait 2 hours. This is not a
   technical decision, it's a business decision regarding the relative value
   of the time invested vs the value of the result. If I can cure cancer by
   indexing for a year, that might be worth it... (hyperbole of course).

Things I would consider to have technical merit that I don't hear:

   1. Impact on the speed of **other** indexing operations. (devaluation of
   other functionality)
   2. Actual scenarios that work when the limit is low and fail when the
   limit is high (new failure on the same data with the limit raised).

One thing that might or might not have technical merit

   1. If someone feels there is a lack of documentation of the
   costs/performance implications of using large vectors, possibly including
   reproducible benchmarks establishing the scaling behavior (there seems to
   be disagreement on O(n) vs O(n^2)).

The users *should* know what they are getting into, but if the cost is
worth it to them, they should be able to pay it without forking the
project. If this veto causes a fork that's not good.

On Sun, Apr 9, 2023 at 7:55 AM Michael Sokolov <msoko...@gmail.com> wrote:

> We do have a dataset built from Wikipedia in luceneutil. It comes in 100
> and 300 dimensional varieties and can easily enough generate large numbers
> of vector documents from the articles data. To go higher we could
> concatenate vectors from that and I believe the performance numbers would
> be plausible.
>
> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss <dawid.we...@gmail.com> wrote:
>
>> Can we set up a branch in which the limit is bumped to 2048, then have
>> a realistic, free data set (wikipedia sample or something) that has,
>> say, 5 million docs and vectors created using public data (glove
>> pre-trained embeddings or the like)? We then could run indexing on the
>> same hardware with 512, 1024 and 2048 and see what the numbers, limits
>> and behavior actually are.
>>
>> I can help in writing this but not until after Easter.
>>
>>
>> Dawid
>>
>> On Sat, Apr 8, 2023 at 11:29 PM Adrien Grand <jpou...@gmail.com> wrote:
>> >
>> > As Dawid pointed out earlier on this thread, this is the rule for
>> > Apache projects: a single -1 vote on a code change is a veto and
>> > cannot be overridden. Furthermore, Robert is one of the people on this
>> > project who worked the most on debugging subtle bugs, making Lucene
>> > more robust and improving our test framework, so I'm listening when he
>> > voices quality concerns.
>> >
>> > The argument against removing/raising the limit that resonates with me
>> > the most is that it is a one-way door. As MikeS highlighted earlier on
>> > this thread, implementations may want to take advantage of the fact
>> > that there is a limit at some point too. This is why I don't want to
>> > remove the limit and would prefer a slight increase, such as 2048 as
>> > suggested in the original issue, which would enable most of the things
>> > that users who have been asking about raising the limit would like to
>> > do.
>> >
>> > I agree that the merge-time memory usage and slow indexing rate are
>> > not great. But it's still possible to index multi-million vector
>> > datasets with a 4GB heap without hitting OOMEs regardless of the
>> > number of dimensions, and the feedback I'm seeing is that many users
>> > are still interested in indexing multi-million vector datasets despite
>> > the slow indexing rate. I wish we could do better, and vector indexing
>> > is certainly more expert than text indexing, but it still is usable in
>> > my opinion. I understand how giving Lucene more information about
>> > vectors prior to indexing (e.g. clustering information as Jim pointed
>> > out) could help make merging faster and more memory-efficient, but I
>> > would really like to avoid making it a requirement for indexing
>> > vectors as it also makes this feature much harder to use.
>> >
>> > On Sat, Apr 8, 2023 at 9:28 PM Alessandro Benedetti
>> > <a.benede...@sease.io> wrote:
>> > >
>> > > I am very attentive to listen opinions but I am un-convinced here and
>> I an not sure that a single person opinion should be allowed to be
>> detrimental for such an important project.
>> > >
>> > > The limit as far as I know is literally just raising an exception.
>> > > Removing it won't alter in any way the current performance for users
>> in low dimensional space.
>> > > Removing it will just enable more users to use Lucene.
>> > >
>> > > If new users in certain situations will be unhappy with the
>> performance, they may contribute improvements.
>> > > This is how you make progress.
>> > >
>> > > If it's a reputation thing, trust me that not allowing users to play
>> with high dimensional space will equally damage it.
>> > >
>> > > To me it's really a no brainer.
>> > > Removing the limit and enable people to use high dimensional vectors
>> will take minutes.
>> > > Improving the hnsw implementation can take months.
>> > > Pick one to begin with...
>> > >
>> > > And there's no-one paying me here, no company interest whatsoever,
>> actually I pay people to contribute, I am just convinced it's a good idea.
>> > >
>> > >
>> > > On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcm...@gmail.com> wrote:
>> > >>
>> > >> I disagree with your categorization. I put in plenty of work and
>> > >> experienced plenty of pain myself, writing tests and fighting these
>> > >> issues, after i saw that, two releases in a row, vector indexing fell
>> > >> over and hit integer overflows etc on small datasets:
>> > >>
>> > >> https://github.com/apache/lucene/pull/11905
>> > >>
>> > >> Attacking me isn't helping the situation.
>> > >>
>> > >> PS: when i said the "one guy who wrote the code" I didn't mean it in
>> > >> any kind of demeaning fashion really. I meant to describe the current
>> > >> state of usability with respect to indexing a few million docs with
>> > >> high dimensions. You can scroll up the thread and see that at least
>> > >> one other committer on the project experienced similar pain as me.
>> > >> Then, think about users who aren't committers trying to use the
>> > >> functionality!
>> > >>
>> > >> On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov <msoko...@gmail.com>
>> wrote:
>> > >> >
>> > >> > What you said about increasing dimensions requiring a bigger ram
>> buffer on merge is wrong. That's the point I was trying to make. Your
>> concerns about merge costs are not wrong, but your conclusion that we need
>> to limit dimensions is not justified.
>> > >> >
>> > >> > You complain that hnsw sucks it doesn't scale, but when I show it
>> scales linearly with dimension you just ignore that and complain about
>> something entirely different.
>> > >> >
>> > >> > You demand that people run all kinds of tests to prove you wrong
>> but when they do, you don't listen and you won't put in the work yourself
>> or complain that it's too hard.
>> > >> >
>> > >> > Then you complain about people not meeting you half way. Wow
>> > >> >
>> > >> > On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcm...@gmail.com>
>> wrote:
>> > >> >>
>> > >> >> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner
>> > >> >> <michael.wech...@wyona.com> wrote:
>> > >> >> >
>> > >> >> > What exactly do you consider reasonable?
>> > >> >>
>> > >> >> Let's begin a real discussion by being HONEST about the current
>> > >> >> status. Please put politically correct or your own company's
>> wishes
>> > >> >> aside, we know it's not in a good state.
>> > >> >>
>> > >> >> Current status is the one guy who wrote the code can set a
>> > >> >> multi-gigabyte ram buffer and index a small dataset with 1024
>> > >> >> dimensions in HOURS (i didn't ask what hardware).
>> > >> >>
>> > >> >> My concerns are everyone else except the one guy, I want it to be
>> > >> >> usable. Increasing dimensions just means even bigger
>> multi-gigabyte
>> > >> >> ram buffer and bigger heap to avoid OOM on merge.
>> > >> >> It is also a permanent backwards compatibility decision, we have
>> to
>> > >> >> support it once we do this and we can't just say "oops" and flip
>> it
>> > >> >> back.
>> > >> >>
>> > >> >> It is unclear to me, if the multi-gigabyte ram buffer is really to
>> > >> >> avoid merges because they are so slow and it would be DAYS
>> otherwise,
>> > >> >> or if its to avoid merges so it doesn't hit OOM.
>> > >> >> Also from personal experience, it takes trial and error (means
>> > >> >> experiencing OOM on merge!!!) before you get those heap values
>> correct
>> > >> >> for your dataset. This usually means starting over which is
>> > >> >> frustrating and wastes more time.
>> > >> >>
>> > >> >> Jim mentioned some ideas about the memory usage in IndexWriter,
>> seems
>> > >> >> to me like its a good idea. maybe the multigigabyte ram buffer
>> can be
>> > >> >> avoided in this way and performance improved by writing bigger
>> > >> >> segments with lucene's defaults. But this doesn't mean we can
>> simply
>> > >> >> ignore the horrors of what happens on merge. merging needs to
>> scale so
>> > >> >> that indexing really scales.
>> > >> >>
>> > >> >> At least it shouldnt spike RAM on trivial data amounts and cause
>> OOM,
>> > >> >> and definitely it shouldnt burn hours and hours of CPU in O(n^2)
>> > >> >> fashion when indexing.
>> > >> >>
>> > >> >>
>> ---------------------------------------------------------------------
>> > >> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > >> >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> > >> >>
>> > >>
>> > >> ---------------------------------------------------------------------
>> > >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > >> For additional commands, e-mail: dev-h...@lucene.apache.org
>> > >>
>> >
>> >
>> > --
>> > Adrien
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: [Proposal] Remove max number of dimensions for KNN vectors

Reply via email to