Re: [VOTE] Dimension Limit for KNN Vectors

Robert Muir Wed, 17 May 2023 02:16:10 -0700

As a reminder this isn't the Disney Plus channel and I'll use strong
language if I fucking want to.




On Wed, May 17, 2023, 4:45 AM Alessandro Benedetti <a.benede...@sease.io>
wrote:

> Robert,
> A gentle reminder of the
> https://www.apache.org/foundation/policies/conduct.html.
> I've read many e-mails about this topic that ended up in a tone that is
> not up to the standard of a healthy community.
> To be specific and pragmatic how you addressed Gus here, how you addressed
> the rest of our community mocking us as sort of "ChatGPT minions" and the
> usage of bad words in English (f* word), does not make sense and it's not
> acceptable here.
> Even if you feel heated, I recommend separating such emotions from what
> you write and always being respectful of other people with different ideas.
> You are an intelligent person, don't ruin your time (and others' time) on
> a wonderful project such as Lucene, blinded by excessive emotion.
> Please remember that the vast majority of us participate in this community
> purely on a volunteering basis.
> So when I spend time on this, I like to see respect,
> thoughtful discussions, and intellectual challenges, the time we spend
> together must be peaceful and positive.
>
> The community comes first and here we are collecting what the community
> would like for a feature.
> Your vote and opinion are extremely valuable, but at this stage, we are
> here to listen to the community rather than imposing a personal idea.
> Once we observe the dominant need, we'll proceed with a contribution.
> If you disagree with such a contribution and bring technical evidence that
> supports a convincing veto, we (the Lucene community) will listen and
> improve/change the contribution.
> If you disagree with such a contribution and bring an unconvincing veto,
> we (the Lucene community) will proceed with steps that are appropriate for
> the situation.
> Let's also remember that the project and the community come first, Lucene
> is an Apache project, not mine or yours for that matters.
>
> Cheers
>
> --------------------------
> *Alessandro Benedetti*
> Director @ Sease Ltd.
> *Apache Lucene/Solr Committer*
> *Apache Solr PMC Member*
>
> e-mail: a.benede...@sease.io
>
>
> *Sease* - Information Retrieval Applied
> Consulting | Training | Open Source
>
> Website: Sease.io <http://sease.io/>
> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
> <https://twitter.com/seaseltd> | Youtube
> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
> <https://github.com/seaseltd>
>
>
> On Wed, 17 May 2023 at 01:54, Robert Muir <rcm...@gmail.com> wrote:
>
>> Gus, I think i explained myself multiple times on issues and in this
>> thread. the performance is unacceptable, everyone knows it, but nobody is
>> talking about.
>> I don't need to explain myself time and time again here.
>> You don't seem to understand the technical issues (at least you sure as
>> fuck don't know how service loading works or you wouldnt have opened
>> https://github.com/apache/lucene/issues/12300 😂)
>>
>> I'm just the only one here completely unconstrained by any of silicon
>> valley's influences to speak my true mind, without any repercussions, so I
>> do it. Don't give any fucks about ChatGPT.
>>
>> I'm standing by my technical veto. If you bypass it, I'll revert the
>> offending commit.
>>
>> As far as fixing the technical performance, I just opened an issue with
>> some ideas to at least improve cpu usage by a factor of N. It does not help
>> with the crazy heap memory usage or other issues of KNN implementation
>> causing shit like OOM on merge. But it is one step:
>> https://github.com/apache/lucene/issues/12302
>>
>>
>>
>> On Tue, May 16, 2023 at 7:45 AM Gus Heck <gus.h...@gmail.com> wrote:
>>
>>> Robert,
>>>
>>> Can you explain in clear technical terms the standard that must be met
>>> for performance? A benchmark that must run in X time on Y hardware for
>>> example (and why that test is suitable)? Or some other reproducible
>>> criteria? So far I've heard you give an *opinion* that it's unusable, but
>>> that's not a technical criteria, others may have a different concept of
>>> what is usable to them.
>>>
>>> Forgive me if I misunderstand, but the essence of your argument has
>>> seemed to be
>>>
>>> "Performance isn't good enough, therefore we should force anyone who
>>> wants to experiment with something bigger to fork the code base to do it"
>>>
>>> Thus, it is necessary to have a clear unambiguous standard that anyone
>>> can verify for "good enough". A clear standard would also focus efforts at
>>> improvement.
>>>
>>> Where are the goal posts?
>>>
>>> FWIW I'm +1 on any of 2-4 since I believe the existence of a hard limit
>>> is fundamentally counterproductive in an open source setting, as it will
>>> lead to *fewer people* pushing the limits. Extremely few people are
>>> going to get into the nitty-gritty of optimizing things unless they are
>>> staring at code that they can prove does something interesting, but doesn't
>>> run fast enough for their purposes. If people hit a hard limit, more of
>>> them give up and never develop the code that will motivate them to look for
>>> optimizations.
>>>
>>> -Gus
>>>
>>> On Tue, May 16, 2023 at 6:04 AM Robert Muir <rcm...@gmail.com> wrote:
>>>
>>>> i still feel -1 (veto) on increasing this limit. sending more emails
>>>> does not change the technical facts or make the veto go away.
>>>>
>>>> On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti <
>>>> a.benede...@sease.io> wrote:
>>>>
>>>>> Hi all,
>>>>> we have finalized all the options proposed by the community and we are
>>>>> ready to vote for the preferred one and then proceed with the
>>>>> implementation.
>>>>>
>>>>> *Option 1*
>>>>> Keep it as it is (dimension limit hardcoded to 1024)
>>>>> *Motivation*:
>>>>> We are close to improving on many fronts. Given the criticality of
>>>>> Lucene in computing infrastructure and the concerns raised by one of the
>>>>> most active stewards of the project, I think we should keep working toward
>>>>> improving the feature as is and move to up the limit after we can
>>>>> demonstrate improvement unambiguously.
>>>>>
>>>>> *Option 2*
>>>>> make the limit configurable, for example through a system property
>>>>> *Motivation*:
>>>>> The system administrator can enforce a limit its users need to respect
>>>>> that it's in line with whatever the admin decided to be acceptable for
>>>>> them.
>>>>> The default can stay the current one.
>>>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>>>> and any sort of plugin development
>>>>>
>>>>> *Option 3*
>>>>> Move the max dimension limit lower level to a HNSW specific
>>>>> implementation. Once there, this limit would not bind any other potential
>>>>> vector engine alternative/evolution.
>>>>> *Motivation:* There seem to be contradictory performance
>>>>> interpretations about the current HNSW implementation. Some consider its
>>>>> performance ok, some not, and it depends on the target data set and use
>>>>> case. Increasing the max dimension limit where it is currently (in top
>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>>>> other use-cases) to be based on a lower limit.
>>>>>
>>>>> *Option 4*
>>>>> Make it configurable and move it to an appropriate place.
>>>>> In particular, a
>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be
>>>>> enough.
>>>>> *Motivation*:
>>>>> Both are good and not mutually exclusive and could happen in any order.
>>>>> Someone suggested to perfect what the _default_ limit should be, but
>>>>> I've not seen an argument _against_ configurability.  Especially in this
>>>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>>>
>>>>> I'll keep this [VOTE] open for a week and then proceed to the
>>>>> implementation.
>>>>> --------------------------
>>>>> *Alessandro Benedetti*
>>>>> Director @ Sease Ltd.
>>>>> *Apache Lucene/Solr Committer*
>>>>> *Apache Solr PMC Member*
>>>>>
>>>>> e-mail: a.benede...@sease.io
>>>>>
>>>>>
>>>>> *Sease* - Information Retrieval Applied
>>>>> Consulting | Training | Open Source
>>>>>
>>>>> Website: Sease.io <http://sease.io/>
>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>>>> <https://twitter.com/seaseltd> | Youtube
>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>>>> <https://github.com/seaseltd>
>>>>>
>>>>
>>>
>>> --
>>> http://www.needhamsoftware.com (work)
>>> http://www.the111shift.com (play)
>>>
>>

Re: [VOTE] Dimension Limit for KNN Vectors

Reply via email to