Re: [VOTE] Dimension Limit for KNN Vectors

Robert Muir Tue, 16 May 2023 19:58:25 -0700

My problem is that it impacts the default codec which is supported by our
backwards compatibility policy for many years. We can't just let the user
determine backwards compatibility with a sysprop. how will checkindex work?
We have to have bounds and also allow for more performant implementations
that might have different limitations. And I'm pretty sure we want a faster
implementation than what we have in the future, and it will probably have
different limits.


For other codecs, it is fine to have a different limit as I already said,
as it is implementation dependent. And honestly the stuff in lucene/codecs
can be more "Fast and loose" because it doesn't require the extensive index
back compat guarantee.

Again, penultimate concern is that index back compat guarantee. When it
comes to limits, the proper way is not to just keep bumping them without
technical reasons, instead the correct approach is to fix the technical
problems and make them irrelevant. Great example here (merged this
morning):
https://github.com/apache/lucene/commit/f53eb28af053d7612f7e4d1b2de05d33dc410645


On Tue, May 16, 2023 at 10:49 PM David Smiley <[email protected]> wrote:

> Robert, I have not heard from you (or anyone) an argument against System
> property based configurability (as I described in Option 4 via a System
> property).  Uwe notes wisely some care must be taken to ensure it actually
> works.  Sure, of course.  What concerns do you have with this?
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Tue, May 16, 2023 at 9:50 PM Robert Muir <[email protected]> wrote:
>
>> by the way, i agree with the idea to MOVE THE LIMIT UNCHANGED to the
>> hsnw-specific code.
>>
>> This way, someone can write alternative codec with vectors using some
>> other completely different approach that incorporates a different more
>> appropriate limit (maybe lower, maybe higher) depending upon their
>> tradeoffs. We should encourage this as I think it is the "only true fix" to
>> the scalability issues: use a scalable algorithm! Also, alternative codecs
>> don't force the project into many years of index backwards compatibility,
>> which is really my penultimate concern. We can lock ourselves into a truly
>> bad place and become irrelevant (especially with scalar code implementing
>> all this vector stuff, it is really senseless). In the meantime I suggest
>> we try to reduce pain for the default codec with the current implementation
>> if possible. If it is not possible, we need a new codec that performs.
>>
>> On Tue, May 16, 2023 at 8:53 PM Robert Muir <[email protected]> wrote:
>>
>>> Gus, I think i explained myself multiple times on issues and in this
>>> thread. the performance is unacceptable, everyone knows it, but nobody is
>>> talking about.
>>> I don't need to explain myself time and time again here.
>>> You don't seem to understand the technical issues (at least you sure as
>>> fuck don't know how service loading works or you wouldnt have opened
>>> https://github.com/apache/lucene/issues/12300 😂)
>>>
>>> I'm just the only one here completely unconstrained by any of silicon
>>> valley's influences to speak my true mind, without any repercussions, so I
>>> do it. Don't give any fucks about ChatGPT.
>>>
>>> I'm standing by my technical veto. If you bypass it, I'll revert the
>>> offending commit.
>>>
>>> As far as fixing the technical performance, I just opened an issue with
>>> some ideas to at least improve cpu usage by a factor of N. It does not help
>>> with the crazy heap memory usage or other issues of KNN implementation
>>> causing shit like OOM on merge. But it is one step:
>>> https://github.com/apache/lucene/issues/12302
>>>
>>>
>>>
>>> On Tue, May 16, 2023 at 7:45 AM Gus Heck <[email protected]> wrote:
>>>
>>>> Robert,
>>>>
>>>> Can you explain in clear technical terms the standard that must be met
>>>> for performance? A benchmark that must run in X time on Y hardware for
>>>> example (and why that test is suitable)? Or some other reproducible
>>>> criteria? So far I've heard you give an *opinion* that it's unusable, but
>>>> that's not a technical criteria, others may have a different concept of
>>>> what is usable to them.
>>>>
>>>> Forgive me if I misunderstand, but the essence of your argument has
>>>> seemed to be
>>>>
>>>> "Performance isn't good enough, therefore we should force anyone who
>>>> wants to experiment with something bigger to fork the code base to do it"
>>>>
>>>> Thus, it is necessary to have a clear unambiguous standard that anyone
>>>> can verify for "good enough". A clear standard would also focus efforts at
>>>> improvement.
>>>>
>>>> Where are the goal posts?
>>>>
>>>> FWIW I'm +1 on any of 2-4 since I believe the existence of a hard limit
>>>> is fundamentally counterproductive in an open source setting, as it will
>>>> lead to *fewer people* pushing the limits. Extremely few people are
>>>> going to get into the nitty-gritty of optimizing things unless they are
>>>> staring at code that they can prove does something interesting, but doesn't
>>>> run fast enough for their purposes. If people hit a hard limit, more of
>>>> them give up and never develop the code that will motivate them to look for
>>>> optimizations.
>>>>
>>>> -Gus
>>>>
>>>> On Tue, May 16, 2023 at 6:04 AM Robert Muir <[email protected]> wrote:
>>>>
>>>>> i still feel -1 (veto) on increasing this limit. sending more emails
>>>>> does not change the technical facts or make the veto go away.
>>>>>
>>>>> On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi all,
>>>>>> we have finalized all the options proposed by the community and we
>>>>>> are ready to vote for the preferred one and then proceed with the
>>>>>> implementation.
>>>>>>
>>>>>> *Option 1*
>>>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>>>> *Motivation*:
>>>>>> We are close to improving on many fronts. Given the criticality of
>>>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>>>> most active stewards of the project, I think we should keep working 
>>>>>> toward
>>>>>> improving the feature as is and move to up the limit after we can
>>>>>> demonstrate improvement unambiguously.
>>>>>>
>>>>>> *Option 2*
>>>>>> make the limit configurable, for example through a system property
>>>>>> *Motivation*:
>>>>>> The system administrator can enforce a limit its users need to
>>>>>> respect that it's in line with whatever the admin decided to be 
>>>>>> acceptable
>>>>>> for them.
>>>>>> The default can stay the current one.
>>>>>> This should open the doors for Apache Solr, Elasticsearch,
>>>>>> OpenSearch, and any sort of plugin development
>>>>>>
>>>>>> *Option 3*
>>>>>> Move the max dimension limit lower level to a HNSW specific
>>>>>> implementation. Once there, this limit would not bind any other potential
>>>>>> vector engine alternative/evolution.
>>>>>> *Motivation:* There seem to be contradictory performance
>>>>>> interpretations about the current HNSW implementation. Some consider its
>>>>>> performance ok, some not, and it depends on the target data set and use
>>>>>> case. Increasing the max dimension limit where it is currently (in top
>>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>>>> other use-cases) to be based on a lower limit.
>>>>>>
>>>>>> *Option 4*
>>>>>> Make it configurable and move it to an appropriate place.
>>>>>> In particular, a
>>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
>>>>>> enough.
>>>>>> *Motivation*:
>>>>>> Both are good and not mutually exclusive and could happen in any
>>>>>> order.
>>>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>>>> I've not seen an argument _against_ configurability.  Especially in this
>>>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>>>
>>>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>>>> implementation.
>>>>>> --------------------------
>>>>>> *Alessandro Benedetti*
>>>>>> Director @ Sease Ltd.
>>>>>> *Apache Lucene/Solr Committer*
>>>>>> *Apache Solr PMC Member*
>>>>>>
>>>>>> e-mail: [email protected]
>>>>>>
>>>>>>
>>>>>> *Sease* - Information Retrieval Applied
>>>>>> Consulting | Training | Open Source
>>>>>>
>>>>>> Website: Sease.io <http://sease.io/>
>>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>>> <https://github.com/seaseltd>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> http://www.needhamsoftware.com (work)
>>>> http://www.the111shift.com (play)
>>>>
>>>

Re: [VOTE] Dimension Limit for KNN Vectors

Reply via email to