I agree with Dawid,

I am +1 for those two options in combination:

 * option 3 (make limit an HNSW specific thing). New formats may use
   other limits (lower or higher).
 * option 4 (make a system property with HNSW prefix). Adding the
   system property must be done in same way like new properties for
   MMAP directory (including access controller) so it can be denied by
   system admin to be set in code (see
   
https://github.com/apache/lucene/blob/f53eb28af053d7612f7e4d1b2de05d33dc410645/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L327-L346
   for example). Care has to be taken that the static initializers
   won't fail is system properties cannot be read/set (system
   adminitrator enforces default -> see mmap code). It also has to be
   made sure that an index written with raised limit can still be read
   without the limit, so the limit should not be glued into the file
   format. Otherwise I disagree with option 4.

In short: I am fine with making it configurable only for HNSW if the limit is not glued into index format. The default should only be there to by default prevent people from doing wrong things, but changing default should not break reading/modifiying those indexes.

Uwe

Am 16.05.2023 um 15:37 schrieb Dawid Weiss:

I'm for option 3 (limit at algorithm level), with the default there tunable via property (option 4).

I understand Robert's concerns and I'd love to contribute a faster implementation but the reality is - I can't do it at the moment. I feel like experiments are good though and we shouldn't just ban people from trying - if somebody changes the (sane) default and gets burned by performance, perhaps it'll be an itch to work on speeding things up (much like it's already happening with Jonathan's patch).

Dawid

On Tue, May 16, 2023 at 10:50 AM Alessandro Benedetti <a.benede...@sease.io> wrote:

    Hi all,
    we have finalized all the options proposed by the community and we
    are ready to vote for the preferred one and then proceed with the
    implementation.

    *Option 1*
    Keep it as it is (dimension limit hardcoded to 1024)
    *Motivation*:
    We are close to improving on many fronts. Given the criticality of
    Lucene in computing infrastructure and the concerns raised by one
    of the most active stewards of the project, I think we should keep
    working toward improving the feature as is and move to up the
    limit after we can demonstrate improvement unambiguously.

    *Option 2*
    make the limit configurable, for example through a system property
    *Motivation*:
    The system administrator can enforce a limit its users need to
    respect that it's in line with whatever the admin decided to be
    acceptable for them.
    The default can stay the current one.
    This should open the doors for Apache Solr, Elasticsearch,
    OpenSearch, and any sort of plugin development

    *Option 3*
    Move the max dimension limit lower level to a HNSW specific
    implementation. Once there, this limit would not bind any other
    potential vector engine alternative/evolution.*
    *
    *Motivation:*There seem to be contradictory performance
    interpretations about the current HNSW implementation. Some
    consider its performance ok, some not, and it depends on the
    target data set and use case. Increasing the max dimension limit
    where it is currently (in top level FloatVectorValues) would not
    allow potential alternatives (e.g. for other use-cases) to be
    based on a lower limit.

    *Option 4*
    Make it configurable and move it to an appropriate place.
    In particular, a
    simple Integer.getInteger("lucene.hnsw.maxDimensions", 1024)
    should be enough.
    *Motivation*:
    Both are good and not mutually exclusive and could happen in any
    order.
    Someone suggested to perfect what the _default_ limit should be,
    but I've not seen an argument _against_ configurability. 
    Especially in this way -- a toggle that doesn't bind Lucene's APIs
    in any way.

    I'll keep this [VOTE] open for a week and then proceed to the
    implementation.
    --------------------------
    *Alessandro Benedetti*
    Director @ Sease Ltd.
    /Apache Lucene/Solr Committer/
    /Apache Solr PMC Member/

    e-mail: a.benede...@sease.io/
    /

    *Sease* - Information Retrieval Applied
    Consulting | Training | Open Source

    Website: Sease.io <http://sease.io/>
    LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
    <https://twitter.com/seaseltd> | Youtube
    <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
    Github <https://github.com/seaseltd>

--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de

Reply via email to