Re: [Proposal] Remove max number of dimensions for KNN vectors

Michael Wechner Wed, 05 Apr 2023 05:11:36 -0700


Am 05.04.23 um 12:34 schrieb Alessandro Benedetti:

Thanks Mike for the insight!

What would be the next steps then?
I see agreement but also the necessity of identifying a candidate MAX.
Should create a VOTE thread, where we propose some values with ajustification and we vote?



+1

Thanks

Michael


In this way we can create a pull request and merge relatively soon.

Cheers

On Tue, 4 Apr 2023, 14:47 Michael Wechner, <michael.wech...@wyona.com>wrote:


    IIUC we all agree that the limit could be raised, but we need some
    solid reasoning what limit makes sense, resp. why do we set this
    particular limit (e.g. 2048), right?

    Thanks

    Michael


    Am 04.04.23 um 15:32 schrieb Michael McCandless:

    > I am not in favor of just doubling it as suggested by some
    people, I would ideally prefer a solution that remains there to a
    decent extent, rather than having to modifying it anytime someone
    requires a higher limit.

    The problem with this approach is it is a one-way door, once
    released.  We would not be able to lower the limit again in the
    future without possibly breaking some applications.

    > For example, we don't limit the number of docs per index to an
    arbitrary maximum of N, you push how many docs you like and if
    they are too much for your system, you get terrible
    performance/crashes/whatever.

    Correction: we do check this limit and throw a specific exception
    now: https://github.com/apache/lucene/issues/6905

    +1 to raise the limit, but not remove it.

    Mike McCandless

    http://blog.mikemccandless.com


    On Mon, Apr 3, 2023 at 9:51 AM Alessandro Benedetti
    <a.benede...@sease.io> wrote:

        ... and what would be the next limit?
        I guess we'll need to motivate it better than the 1024 one.
        I appreciate the fact that a limit is pretty much wanted by
        everyone but I suspect we'll need some solid foundation for
        deciding the amount (and it should be high enough to avoid
        continuous changes)

        Cheers

        On Sun, 2 Apr 2023, 07:29 Michael Wechner,
        <michael.wech...@wyona.com> wrote:

            btw, what was the reasoning to set the current limit to 1024?

            Thanks

            Michael

            Am 01.04.23 um 14:47 schrieb Michael Sokolov:

            I'm also in favor of raising this limit. We do see some
            datasets with higher than 1024 dims. I also think we
            need to keep a limit. For example we currently need to
            keep all the vectors in RAM while indexing and we want
            to be able to support reasonable numbers of vectors in
            an index segment. Also we don't know what innovations
            might come down the road. Maybe someday we want to do
            product quantization and enforce that (k, m) both fit in
            a byte -- we wouldn't be able to do that if a vector's
            dimension were to exceed 32K.

            On Fri, Mar 31, 2023 at 11:57 AM Alessandro Benedetti
            <a.benede...@sease.io> wrote:

                I am also curious what would be the worst-case
                scenario if we remove the constant at all (so
                automatically the limit becomes the Java
                Integer.MAX_VALUE).
                i.e.
                right now if you exceed the limit you get:

                    if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
                    throw new IllegalArgumentException(
                    "cannot index vectors with dimension greater
                    than " + ByteVectorValues.MAX_DIMENSIONS);
                    }


                in relation to:

                    These limits allow us to
                    better tune our data structures, prevent
                    overflows, help ensure we
                    have good test coverage, etc.

                I agree 100% especially for typing stuff properly
                and avoiding resource waste here and there, but I am
                not entirely sure this is the case for the current
                implementation i.e. do we have optimizations in
                place that assume the max dimension to be 1024?
                If I missed that (and I likely have), I of course
                suggest the contribution should not just blindly
                remove the limit, but do it appropriately.
                I am not in favor of just doubling it as suggested
                by some people, I would ideally prefer a solution
                that remains there to a decent extent, rather than
                having to modifying it anytime someone requires a
                higher limit.

                Cheers
                --------------------------
                *Alessandro Benedetti*
                Director @ Sease Ltd.
                /Apache Lucene/Solr Committer/
                /Apache Solr PMC Member/

                e-mail: a.benede...@sease.io/
                /

                *Sease* - Information Retrieval Applied
                Consulting | Training | Open Source

                Website: Sease.io <http://sease.io/>
                LinkedIn <https://linkedin.com/company/sease-ltd> |
                Twitter <https://twitter.com/seaseltd> | Youtube
                <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
                Github <https://github.com/seaseltd>


                On Fri, 31 Mar 2023 at 16:12, Michael Wechner
                <michael.wech...@wyona.com> wrote:

                    OpenAI reduced their size to 1536 dimensions

                    https://openai.com/blog/new-and-improved-embedding-model

                    so 2048 would work :-)

                    but other services do provide also higher
                    dimensions with sometimes
                    slightly better accuracy

                    Thanks

                    Michael


                    Am 31.03.23 um 14:45 schrieb Adrien Grand:
                    > I'm supportive of bumping the limit on the
                    maximum dimension for
                    > vectors to something that is above what the
                    majority of users need,
                    > but I'd like to keep a limit. We have limits
                    for other things like the
                    > max number of docs per index, the max term
                    length, the max number of
                    > dimensions of points, etc. and there are a few
                    things that we don't
                    > have limits on that I wish we had limits on.
                    These limits allow us to
                    > better tune our data structures, prevent
                    overflows, help ensure we
                    > have good test coverage, etc.
                    >
                    > That said, these other limits we have in place
                    are quite high. E.g.
                    > the 32kB term limit, nobody would ever type a
                    32kB term in a text box.
                    > Likewise for the max of 8 dimensions for
                    points: a segment cannot
                    > possibly have 2 splits per dimension on
                    average if it doesn't have
                    > 512*2^(8*2)=34M docs, a sizable dataset
                    already, so more dimensions
                    > than 8 would likely defeat the point of
                    indexing. In contrast, our
                    > limit on the number of dimensions of vectors
                    seems to be under what
                    > some users would like, and while I understand
                    the performance argument
                    > against bumping the limit, it doesn't feel to
                    me like something that
                    > would be so bad that we need to prevent users
                    from using numbers of
                    > dimensions in the low thousands, e.g. top-k
                    KNN searches would still
                    > look at a very small subset of the full dataset.
                    >
                    > So overall, my vote would be to bump the limit
                    to 2048 as suggested by
                    > Mayya on the issue that you linked.
                    >
                    > On Fri, Mar 31, 2023 at 2:38 PM Michael Wechner
                    > <michael.wech...@wyona.com> wrote:
                    >> Thanks Alessandro for summarizing the
                    discussion below!
                    >>
                    >> I understand that there is no clear reasoning
                    re what is the best embedding size, whereas I
                    think heuristic approaches like described by the
                    following link can be helpful
                    >>
                    >>
                    
https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
                    >>
                    >> Having said this, we see various embedding
                    services providing higher dimensions than 1024,
                    like for example OpenAI, Cohere and Aleph Alpha.
                    >>
                    >> And it would be great if we could run
                    benchmarks without having to recompile Lucene
                    ourselves.
                    >>
                    >> Therefore I would to suggest to either
                    increase the limit or even better to remove the
                    limit and add a disclaimer, that people should
                    be aware of possible crashes etc.
                    >>
                    >> Thanks
                    >>
                    >> Michael
                    >>
                    >>
                    >>
                    >>
                    >> Am 31.03.23 um 11:43 schrieb Alessandro
                    Benedetti:
                    >>
                    >>
                    >> I've been monitoring various discussions on
                    Pull Requests about changing the max number of
                    dimensions allowed for Lucene HNSW vectors:
                    >>
                    >> https://github.com/apache/lucene/pull/12191
                    >>
                    >> https://github.com/apache/lucene/issues/11507
                    >>
                    >>
                    >> I would like to set up a discussion and
                    potentially a vote about this.
                    >>
                    >> I have seen some strong opposition from a few
                    people but a majority of favor in this direction.
                    >>
                    >>
                    >> Motivation
                    >>
                    >> We were discussing in the Solr slack channel
                    with Ishan Chattopadhyaya, Marcus Eagan, and
                    David Smiley about some neural search
                    integrations in Solr:
                    https://github.com/openai/chatgpt-retrieval-plugin
                    >>
                    >>
                    >> Proposal
                    >>
                    >> No hard limit at all.
                    >>
                    >> As for many other Lucene areas, users will be
                    allowed to push the system to the limit of their
                    resources and get terrible performances or
                    crashes if they want.
                    >>
                    >>
                    >> What we are NOT discussing
                    >>
                    >> - Quality and scalability of the HNSW algorithm
                    >>
                    >> - dimensionality reduction
                    >>
                    >> - strategies to fit in an arbitrary
                    self-imposed limit
                    >>
                    >>
                    >> Benefits
                    >>
                    >> - users can use the models they want to
                    generate vectors
                    >>
                    >> - removal of an arbitrary limit that blocks
                    some integrations
                    >>
                    >>
                    >> Cons
                    >>
                    >>   - if you go for vectors with high
                    dimensions, there's no guarantee you get
                    acceptable performance for your use case
                    >>
                    >>
                    >>
                    >> I want to keep it simple, right now in many
                    Lucene areas, you can push the system to not
                    acceptable performance/ crashes.
                    >>
                    >> For example, we don't limit the number of
                    docs per index to an arbitrary maximum of N, you
                    push how many docs you like and if they are too
                    much for your system, you get terrible
                    performance/crashes/whatever.
                    >>
                    >>
                    >> Limits caused by primitive java types will
                    stay there behind the scene, and that's
                    acceptable, but I would prefer to not have
                    arbitrary hard-coded ones that may limit the
                    software usability and integration which is
                    extremely important for a library.
                    >>
                    >>
                    >> I strongly encourage people to add benefits
                    and cons, that I missed (I am sure I missed some
                    of them, but wanted to keep it simple)
                    >>
                    >>
                    >> Cheers
                    >>
                    >> --------------------------
                    >> Alessandro Benedetti
                    >> Director @ Sease Ltd.
                    >> Apache Lucene/Solr Committer
                    >> Apache Solr PMC Member
                    >>
                    >> e-mail: a.benede...@sease.io
                    >>
                    >>
                    >> Sease - Information Retrieval Applied
                    >> Consulting | Training | Open Source
                    >>
                    >> Website: Sease.io
                    >> LinkedIn | Twitter | Youtube | Github
                    >>
                    >>
                    >


                    
---------------------------------------------------------------------
                    To unsubscribe, e-mail:
                    dev-unsubscr...@lucene.apache.org
                    For additional commands, e-mail:
                    dev-h...@lucene.apache.org

Re: [Proposal] Remove max number of dimensions for KNN vectors

Reply via email to