btw, what was the reasoning to set the current limit to 1024?

Thanks

Michael

Am 01.04.23 um 14:47 schrieb Michael Sokolov:
I'm also in favor of raising this limit. We do see some datasets with higher than 1024 dims. I also think we need to keep a limit. For example we currently need to keep all the vectors in RAM while indexing and we want to be able to support reasonable numbers of vectors in an index segment. Also we don't know what innovations might come down the road. Maybe someday we want to do product quantization and enforce that (k, m) both fit in a byte -- we wouldn't be able to do that if a vector's dimension were to exceed 32K.

On Fri, Mar 31, 2023 at 11:57 AM Alessandro Benedetti <a.benede...@sease.io> wrote:

    I am also curious what would be the worst-case scenario if we
    remove the constant at all (so automatically the limit becomes the
    Java Integer.MAX_VALUE).
    i.e.
    right now if you exceed the limit you get:

        if (dimension > ByteVectorValues.MAX_DIMENSIONS) {
        throw new IllegalArgumentException(
        "cannot index vectors with dimension greater than " +
        ByteVectorValues.MAX_DIMENSIONS);
        }


    in relation to:

        These limits allow us to
        better tune our data structures, prevent overflows, help ensure we
        have good test coverage, etc.

    I agree 100% especially for typing stuff properly and avoiding
    resource waste here and there, but I am not entirely sure this is
    the case for the current implementation i.e. do we have
    optimizations in place that assume the max dimension to be 1024?
    If I missed that (and I likely have), I of course suggest the
    contribution should not just blindly remove the limit, but do it
    appropriately.
    I am not in favor of just doubling it as suggested by some people,
    I would ideally prefer a solution that remains there to a decent
    extent, rather than having to modifying it anytime someone
    requires a higher limit.

    Cheers
    --------------------------
    *Alessandro Benedetti*
    Director @ Sease Ltd.
    /Apache Lucene/Solr Committer/
    /Apache Solr PMC Member/

    e-mail: a.benede...@sease.io/
    /

    *Sease* - Information Retrieval Applied
    Consulting | Training | Open Source

    Website: Sease.io <http://sease.io/>
    LinkedIn <https://linkedin.com/company/sease-ltd> | Twitter
    <https://twitter.com/seaseltd> | Youtube
    <https://www.youtube.com/channel/UCDx86ZKLYNpI3gzMercM7BQ> |
    Github <https://github.com/seaseltd>


    On Fri, 31 Mar 2023 at 16:12, Michael Wechner
    <michael.wech...@wyona.com> wrote:

        OpenAI reduced their size to 1536 dimensions

        https://openai.com/blog/new-and-improved-embedding-model

        so 2048 would work :-)

        but other services do provide also higher dimensions with
        sometimes
        slightly better accuracy

        Thanks

        Michael


        Am 31.03.23 um 14:45 schrieb Adrien Grand:
        > I'm supportive of bumping the limit on the maximum dimension for
        > vectors to something that is above what the majority of
        users need,
        > but I'd like to keep a limit. We have limits for other
        things like the
        > max number of docs per index, the max term length, the max
        number of
        > dimensions of points, etc. and there are a few things that
        we don't
        > have limits on that I wish we had limits on. These limits
        allow us to
        > better tune our data structures, prevent overflows, help
        ensure we
        > have good test coverage, etc.
        >
        > That said, these other limits we have in place are quite
        high. E.g.
        > the 32kB term limit, nobody would ever type a 32kB term in a
        text box.
        > Likewise for the max of 8 dimensions for points: a segment
        cannot
        > possibly have 2 splits per dimension on average if it
        doesn't have
        > 512*2^(8*2)=34M docs, a sizable dataset already, so more
        dimensions
        > than 8 would likely defeat the point of indexing. In
        contrast, our
        > limit on the number of dimensions of vectors seems to be
        under what
        > some users would like, and while I understand the
        performance argument
        > against bumping the limit, it doesn't feel to me like
        something that
        > would be so bad that we need to prevent users from using
        numbers of
        > dimensions in the low thousands, e.g. top-k KNN searches
        would still
        > look at a very small subset of the full dataset.
        >
        > So overall, my vote would be to bump the limit to 2048 as
        suggested by
        > Mayya on the issue that you linked.
        >
        > On Fri, Mar 31, 2023 at 2:38 PM Michael Wechner
        > <michael.wech...@wyona.com> wrote:
        >> Thanks Alessandro for summarizing the discussion below!
        >>
        >> I understand that there is no clear reasoning re what is
        the best embedding size, whereas I think heuristic approaches
        like described by the following link can be helpful
        >>
        >>
        
https://datascience.stackexchange.com/questions/51404/word2vec-how-to-choose-the-embedding-size-parameter
        >>
        >> Having said this, we see various embedding services
        providing higher dimensions than 1024, like for example
        OpenAI, Cohere and Aleph Alpha.
        >>
        >> And it would be great if we could run benchmarks without
        having to recompile Lucene ourselves.
        >>
        >> Therefore I would to suggest to either increase the limit
        or even better to remove the limit and add a disclaimer, that
        people should be aware of possible crashes etc.
        >>
        >> Thanks
        >>
        >> Michael
        >>
        >>
        >>
        >>
        >> Am 31.03.23 um 11:43 schrieb Alessandro Benedetti:
        >>
        >>
        >> I've been monitoring various discussions on Pull Requests
        about changing the max number of dimensions allowed for Lucene
        HNSW vectors:
        >>
        >> https://github.com/apache/lucene/pull/12191
        >>
        >> https://github.com/apache/lucene/issues/11507
        >>
        >>
        >> I would like to set up a discussion and potentially a vote
        about this.
        >>
        >> I have seen some strong opposition from a few people but a
        majority of favor in this direction.
        >>
        >>
        >> Motivation
        >>
        >> We were discussing in the Solr slack channel with Ishan
        Chattopadhyaya, Marcus Eagan, and David Smiley about some
        neural search integrations in Solr:
        https://github.com/openai/chatgpt-retrieval-plugin
        >>
        >>
        >> Proposal
        >>
        >> No hard limit at all.
        >>
        >> As for many other Lucene areas, users will be allowed to
        push the system to the limit of their resources and get
        terrible performances or crashes if they want.
        >>
        >>
        >> What we are NOT discussing
        >>
        >> - Quality and scalability of the HNSW algorithm
        >>
        >> - dimensionality reduction
        >>
        >> - strategies to fit in an arbitrary self-imposed limit
        >>
        >>
        >> Benefits
        >>
        >> - users can use the models they want to generate vectors
        >>
        >> - removal of an arbitrary limit that blocks some integrations
        >>
        >>
        >> Cons
        >>
        >>   - if you go for vectors with high dimensions, there's no
        guarantee you get acceptable performance for your use case
        >>
        >>
        >>
        >> I want to keep it simple, right now in many Lucene areas,
        you can push the system to not acceptable performance/ crashes.
        >>
        >> For example, we don't limit the number of docs per index to
        an arbitrary maximum of N, you push how many docs you like and
        if they are too much for your system, you get terrible
        performance/crashes/whatever.
        >>
        >>
        >> Limits caused by primitive java types will stay there
        behind the scene, and that's acceptable, but I would prefer to
        not have arbitrary hard-coded ones that may limit the software
        usability and integration which is extremely important for a
        library.
        >>
        >>
        >> I strongly encourage people to add benefits and cons, that
        I missed (I am sure I missed some of them, but wanted to keep
        it simple)
        >>
        >>
        >> Cheers
        >>
        >> --------------------------
        >> Alessandro Benedetti
        >> Director @ Sease Ltd.
        >> Apache Lucene/Solr Committer
        >> Apache Solr PMC Member
        >>
        >> e-mail: a.benede...@sease.io
        >>
        >>
        >> Sease - Information Retrieval Applied
        >> Consulting | Training | Open Source
        >>
        >> Website: Sease.io
        >> LinkedIn | Twitter | Youtube | Github
        >>
        >>
        >


        ---------------------------------------------------------------------
        To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
        For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to