see https://markmail.org/message/kf4nzoqyhwacb7ri
On Wed, May 17, 2023 at 10:09 AM David Smiley <dsmi...@apache.org> wrote: > > easily be circumvented by a user > > This is a revelation to me and others, if true. Michael, please then > point to a test or code snippet that shows the Lucene user community what > they want to see so they are unblocked from their explorations of vector > search. > > ~ David Smiley > Apache Lucene/Solr Search Developer > http://www.linkedin.com/in/davidwsmiley > > > On Wed, May 17, 2023 at 7:51 AM Michael Sokolov <msoko...@gmail.com> > wrote: > >> I think I've said before on this list we don't actually enforce the limit >> in any way that can't easily be circumvented by a user. The codec already >> supports any size vector - it doesn't impose any limit. The way the API is >> written you can *already today* create an index with max-int sized vectors >> and we are committed to supporting that going forward by our backwards >> compatibility policy as Robert points out. This wasn't intentional, I >> think, but it is the facts. >> >> Given that, I think this whole discussion is not really necessary. >> >> On Tue, May 16, 2023 at 4:50 AM Alessandro Benedetti < >> a.benede...@sease.io> wrote: >> >>> Hi all, >>> we have finalized all the options proposed by the community and we are >>> ready to vote for the preferred one and then proceed with the >>> implementation. >>> >>> *Option 1* >>> Keep it as it is (dimension limit hardcoded to 1024) >>> *Motivation*: >>> We are close to improving on many fronts. Given the criticality of >>> Lucene in computing infrastructure and the concerns raised by one of the >>> most active stewards of the project, I think we should keep working toward >>> improving the feature as is and move to up the limit after we can >>> demonstrate improvement unambiguously. >>> >>> *Option 2* >>> make the limit configurable, for example through a system property >>> *Motivation*: >>> The system administrator can enforce a limit its users need to respect >>> that it's in line with whatever the admin decided to be acceptable for >>> them. >>> The default can stay the current one. >>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch, >>> and any sort of plugin development >>> >>> *Option 3* >>> Move the max dimension limit lower level to a HNSW specific >>> implementation. Once there, this limit would not bind any other potential >>> vector engine alternative/evolution. >>> *Motivation:* There seem to be contradictory performance >>> interpretations about the current HNSW implementation. Some consider its >>> performance ok, some not, and it depends on the target data set and use >>> case. Increasing the max dimension limit where it is currently (in top >>> level FloatVectorValues) would not allow potential alternatives (e.g. for >>> other use-cases) to be based on a lower limit. >>> >>> *Option 4* >>> Make it configurable and move it to an appropriate place. >>> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions", >>> 1024) should be enough. >>> *Motivation*: >>> Both are good and not mutually exclusive and could happen in any order. >>> Someone suggested to perfect what the _default_ limit should be, but >>> I've not seen an argument _against_ configurability. Especially in this >>> way -- a toggle that doesn't bind Lucene's APIs in any way. >>> >>> I'll keep this [VOTE] open for a week and then proceed to the >>> implementation. >>> -------------------------- >>> *Alessandro Benedetti* >>> Director @ Sease Ltd. >>> *Apache Lucene/Solr Committer* >>> *Apache Solr PMC Member* >>> >>> e-mail: a.benede...@sease.io >>> >>> >>> *Sease* - Information Retrieval Applied >>> Consulting | Training | Open Source >>> >>> Website: Sease.io <http://sease.io/> >>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter >>> <https://twitter.com/seaseltd> | Youtube >>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github >>> <https://github.com/seaseltd> >>> >>