Re: [VOTE] Dimension Limit for KNN Vectors

Marcus Eagan Tue, 16 May 2023 07:27:31 -0700

Given that Robert has put in his veto, aren’t we clear on what we need to
do for him to change his mind? He’s been pretty clear and the rules of veto
are cut and dry.


Most of the people that have contributed to kNN vectors recently are not
even on the thread. I think improving the feature should be the focus of
the Lucene community at this juncture.

On Tue, May 16, 2023 at 7:09 AM Houston Putman <hous...@apache.org> wrote:

> +1 on the combination of #3 and #4.
>
> Also good things to make sure of Uwe, thanks for calling those out.
> (Especially about the limit only being used on write, not on read).
>
> - Houston
>
> On Tue, May 16, 2023 at 9:57 AM Uwe Schindler <u...@thetaphi.de> wrote:
>
>> I agree with Dawid,
>>
>> I am +1 for those two options in combination:
>>
>>    - option 3 (make limit an HNSW specific thing). New formats may use
>>    other limits (lower or higher).
>>    - option 4 (make a system property with HNSW prefix). Adding the
>>    system property must be done in same way like new properties for MMAP
>>    directory (including access controller) so it can be denied by system 
>> admin
>>    to be set in code (see
>>    
>> https://github.com/apache/lucene/blob/f53eb28af053d7612f7e4d1b2de05d33dc410645/lucene/core/src/java/org/apache/lucene/store/MMapDirectory.java#L327-L346
>>    for example). Care has to be taken that the static initializers won't fail
>>    is system properties cannot be read/set (system adminitrator enforces
>>    default -> see mmap code). It also has to be made sure that an index
>>    written with raised limit can still be read without the limit, so the 
>> limit
>>    should not be glued into the file format. Otherwise I disagree with option
>>    4.
>>
>> In short: I am fine with making it configurable only for HNSW if the
>> limit is not glued into index format. The default should only be there to
>> by default prevent people from doing wrong things, but changing default
>> should not break reading/modifiying those indexes.
>>
>> Uwe
>> Am 16.05.2023 um 15:37 schrieb Dawid Weiss:
>>
>>
>> I'm for option 3 (limit at algorithm level), with the default there
>> tunable via property (option 4).
>>
>> I understand Robert's concerns and I'd love to contribute a faster
>> implementation but the reality is - I can't do it at the moment. I feel
>> like experiments are good though and we shouldn't just ban people from
>> trying - if somebody changes the (sane) default and gets burned by
>> performance, perhaps it'll be an itch to work on speeding things up (much
>> like it's already happening with Jonathan's patch).
>>
>> Dawid
>>
>> On Tue, May 16, 2023 at 10:50 AM Alessandro Benedetti <
>> a.benede...@sease.io> wrote:
>>
>>> Hi all,
>>> we have finalized all the options proposed by the community and we are
>>> ready to vote for the preferred one and then proceed with the
>>> implementation.
>>>
>>> *Option 1*
>>> Keep it as it is (dimension limit hardcoded to 1024)
>>> *Motivation*:
>>> We are close to improving on many fronts. Given the criticality of
>>> Lucene in computing infrastructure and the concerns raised by one of the
>>> most active stewards of the project, I think we should keep working toward
>>> improving the feature as is and move to up the limit after we can
>>> demonstrate improvement unambiguously.
>>>
>>> *Option 2*
>>> make the limit configurable, for example through a system property
>>> *Motivation*:
>>> The system administrator can enforce a limit its users need to respect
>>> that it's in line with whatever the admin decided to be acceptable for
>>> them.
>>> The default can stay the current one.
>>> This should open the doors for Apache Solr, Elasticsearch, OpenSearch,
>>> and any sort of plugin development
>>>
>>> *Option 3*
>>> Move the max dimension limit lower level to a HNSW specific
>>> implementation. Once there, this limit would not bind any other potential
>>> vector engine alternative/evolution.
>>> *Motivation:* There seem to be contradictory performance
>>> interpretations about the current HNSW implementation. Some consider its
>>> performance ok, some not, and it depends on the target data set and use
>>> case. Increasing the max dimension limit where it is currently (in top
>>> level FloatVectorValues) would not allow potential alternatives (e.g. for
>>> other use-cases) to be based on a lower limit.
>>>
>>> *Option 4*
>>> Make it configurable and move it to an appropriate place.
>>> In particular, a simple Integer.getInteger("lucene.hnsw.maxDimensions",
>>> 1024) should be enough.
>>> *Motivation*:
>>> Both are good and not mutually exclusive and could happen in any order.
>>> Someone suggested to perfect what the _default_ limit should be, but
>>> I've not seen an argument _against_ configurability.  Especially in this
>>> way -- a toggle that doesn't bind Lucene's APIs in any way.
>>>
>>> I'll keep this [VOTE] open for a week and then proceed to the
>>> implementation.
>>> --------------------------
>>> *Alessandro Benedetti*
>>> Director @ Sease Ltd.
>>> *Apache Lucene/Solr Committer*
>>> *Apache Solr PMC Member*
>>>
>>> e-mail: a.benede...@sease.io
>>>
>>>
>>> *Sease* - Information Retrieval Applied
>>> Consulting | Training | Open Source
>>>
>>> Website: Sease.io <http://sease.io/>
>>> LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
>>> <https://twitter.com/seaseltd> | Youtube
>>> <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> | Github
>>> <https://github.com/seaseltd>
>>>
>> --
>> Uwe SchindlerAchterdiek 19, D-28357 Bremen 
>> <https://www.google.com/maps/search/Achterdiek+19,+D-28357+Bremen?entry=gmail&source=g>https://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>
>> --
Marcus Eagan

Re: [VOTE] Dimension Limit for KNN Vectors

Reply via email to