We agree backwards compatibility with the index should be maintained and that checkIndex should work. And we agree on a number of other things, but I want to focus on configurability. As long as the index contains the number of dimensions actually used in a specific segment & field, why couldn't checkIndex work if the dimension *limit* is configurable? It's not checkindex's job to enforce the limit, only to check that the data appears consistent / valid, irrespective of how the number of dimensions came to be specified originally.
~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley On Tue, May 16, 2023 at 10:58 PM Robert Muir <rcm...@gmail.com> wrote: > My problem is that it impacts the default codec which is supported by our > backwards compatibility policy for many years. We can't just let the user > determine backwards compatibility with a sysprop. how will checkindex work? > We have to have bounds and also allow for more performant implementations > that might have different limitations. And I'm pretty sure we want a faster > implementation than what we have in the future, and it will probably have > different limits. > > For other codecs, it is fine to have a different limit as I already said, > as it is implementation dependent. And honestly the stuff in lucene/codecs > can be more "Fast and loose" because it doesn't require the extensive index > back compat guarantee. > > Again, penultimate concern is that index back compat guarantee. When it > comes to limits, the proper way is not to just keep bumping them without > technical reasons, instead the correct approach is to fix the technical > problems and make them irrelevant. Great example here (merged this > morning): > https://github.com/apache/lucene/commit/f53eb28af053d7612f7e4d1b2de05d33dc410645 > > > On Tue, May 16, 2023 at 10:49 PM David Smiley <dsmi...@apache.org> wrote: > >> Robert, I have not heard from you (or anyone) an argument against System >> property based configurability (as I described in Option 4 via a System >> property). Uwe notes wisely some care must be taken to ensure it actually >> works. Sure, of course. What concerns do you have with this? >> >> ~ David Smiley >> Apache Lucene/Solr Search Developer >> http://www.linkedin.com/in/davidwsmiley >> >> >> On Tue, May 16, 2023 at 9:50 PM Robert Muir <rcm...@gmail.com> wrote: >> >>> by the way, i agree with the idea to MOVE THE LIMIT UNCHANGED to the >>> hsnw-specific code. >>> >>> This way, someone can write alternative codec with vectors using some >>> other completely different approach that incorporates a different more >>> appropriate limit (maybe lower, maybe higher) depending upon their >>> tradeoffs. We should encourage this as I think it is the "only true fix" to >>> the scalability issues: use a scalable algorithm! Also, alternative codecs >>> don't force the project into many years of index backwards compatibility, >>> which is really my penultimate concern. We can lock ourselves into a truly >>> bad place and become irrelevant (especially with scalar code implementing >>> all this vector stuff, it is really senseless). In the meantime I suggest >>> we try to reduce pain for the default codec with the current implementation >>> if possible. If it is not possible, we need a new codec that performs. >>> >>> On Tue, May 16, 2023 at 8:53 PM Robert Muir <rcm...@gmail.com> wrote: >>> >>>> Gus, I think i explained myself multiple times on issues and in this >>>> thread. the performance is unacceptable, everyone knows it, but nobody is >>>> talking about. >>>> I don't need to explain myself time and time again here. >>>> You don't seem to understand the technical issues (at least you sure as >>>> fuck don't know how service loading works or you wouldnt have opened >>>> https://github.com/apache/lucene/issues/12300 😂) >>>> >>>> I'm just the only one here completely unconstrained by any of silicon >>>> valley's influences to speak my true mind, without any repercussions, so I >>>> do it. Don't give any fucks about ChatGPT. >>>> >>>> I'm standing by my technical veto. If you bypass it, I'll revert the >>>> offending commit. >>>> >>>> As far as fixing the technical performance, I just opened an issue with >>>> some ideas to at least improve cpu usage by a factor of N. It does not help >>>> with the crazy heap memory usage or other issues of KNN implementation >>>> causing shit like OOM on merge. But it is one step: >>>> https://github.com/apache/lucene/issues/12302 >>>> >>>> >>>> >>>> On Tue, May 16, 2023 at 7:45 AM Gus Heck <gus.h...@gmail.com> wrote: >>>> >>>>> Robert, >>>>> >>>>> Can you explain in clear technical terms the standard that must be met >>>>> for performance? A benchmark that must run in X time on Y hardware for >>>>> example (and why that test is suitable)? Or some other reproducible >>>>> criteria? So far I've heard you give an *opinion* that it's unusable, but >>>>> that's not a technical criteria, others may have a different concept of >>>>> what is usable to them. >>>>> >>>>> Forgive me if I misunderstand, but the essence of your argument has >>>>> seemed to be >>>>> >>>>> "Performance isn't good enough, therefore we should force anyone who >>>>> wants to experiment with something bigger to fork the code base to do it" >>>>> >>>>> Thus, it is necessary to have a clear unambiguous standard that anyone >>>>> can verify for "good enough". A clear standard would also focus efforts at >>>>> improvement. >>>>> >>>>> Where are the goal posts? >>>>> >>>>> FWIW I'm +1 on any of 2-4 since I believe the existence of a hard >>>>> limit is fundamentally counterproductive in an open source setting, as it >>>>> will lead to *fewer people* pushing the limits. Extremely few people >>>>> are going to get into the nitty-gritty of optimizing things unless they >>>>> are >>>>> staring at code that they can prove does something interesting, but >>>>> doesn't >>>>> run fast enough for their purposes. If people hit a hard limit, more of >>>>> them give up and never develop the code that will motivate them to look >>>>> for >>>>> optimizations. >>>>> >>>>> -Gus >>>>> >>>>> On Tue, May 16, 2023 at 6:04 AM Robert Muir <rcm...@gmail.com> wrote: >>>>> >>>>>> i still feel -1 (veto) on increasing this limit. sending more emails >>>>>> does not change the technical facts or make the veto go away. >>>>>> >>>>>> On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti < >>>>>> a.benede...@sease.io> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> we have finalized all the options proposed by the community and we >>>>>>> are ready to vote for the preferred one and then proceed with the >>>>>>> implementation. >>>>>>> >>>>>>> *Option 1* >>>>>>> Keep it as it is (dimension limit hardcoded to 1024) >>>>>>> *Motivation*: >>>>>>> We are close to improving on many fronts. Given the criticality of >>>>>>> Lucene in computing infrastructure and the concerns raised by one of the >>>>>>> most active stewards of the project, I think we should keep working >>>>>>> toward >>>>>>> improving the feature as is and move to up the limit after we can >>>>>>> demonstrate improvement unambiguously. >>>>>>> >>>>>>> *Option 2* >>>>>>> make the limit configurable, for example through a system property >>>>>>> *Motivation*: >>>>>>> The system administrator can enforce a limit its users need to >>>>>>> respect that it's in line with whatever the admin decided to be >>>>>>> acceptable >>>>>>> for them. >>>>>>> The default can stay the current one. >>>>>>> This should open the doors for Apache Solr, Elasticsearch, >>>>>>> OpenSearch, and any sort of plugin development >>>>>>> >>>>>>> *Option 3* >>>>>>> Move the max dimension limit lower level to a HNSW specific >>>>>>> implementation. Once there, this limit would not bind any other >>>>>>> potential >>>>>>> vector engine alternative/evolution. >>>>>>> *Motivation:* There seem to be contradictory performance >>>>>>> interpretations about the current HNSW implementation. Some consider its >>>>>>> performance ok, some not, and it depends on the target data set and use >>>>>>> case. Increasing the max dimension limit where it is currently (in top >>>>>>> level FloatVectorValues) would not allow potential alternatives (e.g. >>>>>>> for >>>>>>> other use-cases) to be based on a lower limit. >>>>>>> >>>>>>> *Option 4* >>>>>>> Make it configurable and move it to an appropriate place. >>>>>>> In particular, a >>>>>>> simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024) should be >>>>>>> enough. >>>>>>> *Motivation*: >>>>>>> Both are good and not mutually exclusive and could happen in any >>>>>>> order. >>>>>>> Someone suggested to perfect what the _default_ limit should be, but >>>>>>> I've not seen an argument _against_ configurability. Especially in this >>>>>>> way -- a toggle that doesn't bind Lucene's APIs in any way. >>>>>>> >>>>>>> I'll keep this [VOTE] open for a week and then proceed to the >>>>>>> implementation. >>>>>>> -------------------------- >>>>>>> *Alessandro Benedetti* >>>>>>> Director @ Sease Ltd. >>>>>>> *Apache Lucene/Solr Committer* >>>>>>> *Apache Solr PMC Member* >>>>>>> >>>>>>> e-mail: a.benede...@sease.io >>>>>>> >>>>>>> >>>>>>> *Sease* - Information Retrieval Applied >>>>>>> Consulting | Training | Open Source >>>>>>> >>>>>>> Website: Sease.io <http://sease.io/> >>>>>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter >>>>>>> <https://twitter.com/seaseltd> | Youtube >>>>>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github >>>>>>> <https://github.com/seaseltd> >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> http://www.needhamsoftware.com (work) >>>>> http://www.the111shift.com (play) >>>>> >>>>