+1 on the combination of #3 and #4. Also good things to make sure of Uwe, thanks for calling those out. (Especially about the limit only being used on write, not on read).
- Houston On Tue, May 16, 2023 at 9:57 AM Uwe Schindler <u...@thetaphi.de> wrote: > I agree with Dawid, > > I am +1 for those two options in combination: > > - option 3 (make limit an HNSW specific thing). New formats may use > other limits (lower or higher). > - option 4 (make a system property with HNSW prefix). Adding the > system property must be done in same way like new properties for MMAP > directory (including access controller) so it can be denied by system admin > to be set in code (see > > https://github.com/apache/lucene/blob/f53eb28af053d7612f7e4d1b2de05d33dc410645/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L327-L346 > for example). Care has to be taken that the static initializers won't fail > is system properties cannot be read/set (system adminitrator enforces > default -> see mmap code). It also has to be made sure that an index > written with raised limit can still be read without the limit, so the limit > should not be glued into the file format. Otherwise I disagree with option > 4. > > In short: I am fine with making it configurable only for HNSW if the limit > is not glued into index format. The default should only be there to by > default prevent people from doing wrong things, but changing default should > not break reading/modifiying those indexes. > > Uwe > Am 16.05.2023 um 15:37 schrieb Dawid Weiss: > > > I'm for option 3 (limit at algorithm level), with the default there > tunable via property (option 4). > > I understand Robert's concerns and I'd love to contribute a faster > implementation but the reality is - I can't do it at the moment. I feel > like experiments are good though and we shouldn't just ban people from > trying - if somebody changes the (sane) default and gets burned by > performance, perhaps it'll be an itch to work on speeding things up (much > like it's already happening with Jonathan's patch). > > Dawid > > On Tue, May 16, 2023 at 10:50 AM Alessandro Benedetti < > a.benede...@sease.io> wrote: > >> Hi all, >> we have finalized all the options proposed by the community and we are >> ready to vote for the preferred one and then proceed with the >> implementation. >> >> *Option 1* >> Keep it as it is (dimension limit hardcoded to 1024) >> *Motivation*: >> We are close to improving on many fronts. Given the criticality of Lucene >> in computing infrastructure and the concerns raised by one of the most >> active stewards of the project, I think we should keep working toward >> improving the feature as is and move to up the limit after we can >> demonstrate improvement unambiguously. >> >> *Option 2* >> make the limit configurable, for example through a system property >> *Motivation*: >> The system administrator can enforce a limit its users need to respect >> that it's in line with whatever the admin decided to be acceptable for >> them. >> The default can stay the current one. >> This should open the doors for Apache Solr, Elasticsearch, OpenSearch, >> and any sort of plugin development >> >> *Option 3* >> Move the max dimension limit lower level to a HNSW specific >> implementation. Once there, this limit would not bind any other potential >> vector engine alternative/evolution. >> *Motivation:* There seem to be contradictory performance interpretations >> about the current HNSW implementation. Some consider its performance ok, >> some not, and it depends on the target data set and use case. Increasing >> the max dimension limit where it is currently (in top level >> FloatVectorValues) would not allow potential alternatives (e.g. for other >> use-cases) to be based on a lower limit. >> >> *Option 4* >> Make it configurable and move it to an appropriate place. >> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions", >> 1024) should be enough. >> *Motivation*: >> Both are good and not mutually exclusive and could happen in any order. >> Someone suggested to perfect what the _default_ limit should be, but I've >> not seen an argument _against_ configurability. Especially in this way -- >> a toggle that doesn't bind Lucene's APIs in any way. >> >> I'll keep this [VOTE] open for a week and then proceed to the >> implementation. >> -------------------------- >> *Alessandro Benedetti* >> Director @ Sease Ltd. >> *Apache Lucene/Solr Committer* >> *Apache Solr PMC Member* >> >> e-mail: a.benede...@sease.io >> >> >> *Sease* - Information Retrieval Applied >> Consulting | Training | Open Source >> >> Website: Sease.io <http://sease.io/> >> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter >> <https://twitter.com/seaseltd> | Youtube >> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github >> <https://github.com/seaseltd> >> > -- > Uwe Schindler > Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de > eMail: u...@thetaphi.de > >