+1 on the combination of #3 and #4.

Also good things to make sure of Uwe, thanks for calling those out.
(Especially about the limit only being used on write, not on read).

- Houston

On Tue, May 16, 2023 at 9:57 AM Uwe Schindler <u...@thetaphi.de> wrote:

> I agree with Dawid,
>
> I am +1 for those two options in combination:
>
>    - option 3 (make limit an HNSW specific thing). New formats may use
>    other limits (lower or higher).
>    - option 4 (make a system property with HNSW prefix). Adding the
>    system property must be done in same way like new properties for MMAP
>    directory (including access controller) so it can be denied by system admin
>    to be set in code (see
>    
> https://github.com/apache/lucene/blob/f53eb28af053d7612f7e4d1b2de05d33dc410645/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L327-L346
>    for example). Care has to be taken that the static initializers won't fail
>    is system properties cannot be read/set (system adminitrator enforces
>    default -> see mmap code). It also has to be made sure that an index
>    written with raised limit can still be read without the limit, so the limit
>    should not be glued into the file format. Otherwise I disagree with option
>    4.
>
> In short: I am fine with making it configurable only for HNSW if the limit
> is not glued into index format. The default should only be there to by
> default prevent people from doing wrong things, but changing default should
> not break reading/modifiying those indexes.
>
> Uwe
> Am 16.05.2023 um 15:37 schrieb Dawid Weiss:
>
>
> I'm for option 3 (limit at algorithm level), with the default there
> tunable via property (option 4).
>
> I understand Robert's concerns and I'd love to contribute a faster
> implementation but the reality is - I can't do it at the moment. I feel
> like experiments are good though and we shouldn't just ban people from
> trying - if somebody changes the (sane) default and gets burned by
> performance, perhaps it'll be an itch to work on speeding things up (much
> like it's already happening with Jonathan's patch).
>
> Dawid
>
> On Tue, May 16, 2023 at 10:50 AM Alessandro Benedetti <
> a.benede...@sease.io> wrote:
>
>> Hi all,
>> we have finalized all the options proposed by the community and we are
>> ready to vote for the preferred one and then proceed with the
>> implementation.
>>
>> *Option 1*
>> Keep it as it is (dimension limit hardcoded to 1024)
>> *Motivation*:
>> We are close to improving on many fronts. Given the criticality of Lucene
>> in computing infrastructure and the concerns raised by one of the most
>> active stewards of the project, I think we should keep working toward
>> improving the feature as is and move to up the limit after we can
>> demonstrate improvement unambiguously.
>>
>> *Option 2*
>> make the limit configurable, for example through a system property
>> *Motivation*:
>> The system administrator can enforce a limit its users need to respect
>> that it's in line with whatever the admin decided to be acceptable for
>> them.
>> The default can stay the current one.
>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>> and any sort of plugin development
>>
>> *Option 3*
>> Move the max dimension limit lower level to a HNSW specific
>> implementation. Once there, this limit would not bind any other potential
>> vector engine alternative/evolution.
>> *Motivation:* There seem to be contradictory performance interpretations
>> about the current HNSW implementation. Some consider its performance ok,
>> some not, and it depends on the target data set and use case. Increasing
>> the max dimension limit where it is currently (in top level
>> FloatVectorValues) would not allow potential alternatives (e.g. for other
>> use-cases) to be based on a lower limit.
>>
>> *Option 4*
>> Make it configurable and move it to an appropriate place.
>> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
>> 1024) should be enough.
>> *Motivation*:
>> Both are good and not mutually exclusive and could happen in any order.
>> Someone suggested to perfect what the _default_ limit should be, but I've
>> not seen an argument _against_ configurability.  Especially in this way --
>> a toggle that doesn't bind Lucene's APIs in any way.
>>
>> I'll keep this [VOTE] open for a week and then proceed to the
>> implementation.
>> --------------------------
>> *Alessandro Benedetti*
>> Director @ Sease Ltd.
>> *Apache Lucene/Solr Committer*
>> *Apache Solr PMC Member*
>>
>> e-mail: a.benede...@sease.io
>>
>>
>> *Sease* - Information Retrieval Applied
>> Consulting | Training | Open Source
>>
>> Website: Sease.io <http://sease.io/>
>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>> <https://twitter.com/seaseltd> | Youtube
>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>> <https://github.com/seaseltd>
>>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremenhttps://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>

Reply via email to