Re: [Proposal] Remove max number of dimensions for KNN vectors

Michael Wechner Wed, 12 Apr 2023 22:58:33 -0700

Hi Kent

Great, thank you very much!


Will download it later today :-)

All the best

Michael

Am 13.04.23 um 01:35 schrieb Kent Fitch:

Hi Michael (and anyone else who wants just over 240K "real world"ada-002 vectors of dimension 1536),

you are welcome to retrieve a tar.gz file which contains:
- 47K embeddings of Canberra Times news article text from 1994

- 38K embeddings of the first paragraphs of wikipedia articles aboutorganisations- 156.6K embeddings of the first paragraphs of wikipedia articlesabout people


https://drive.google.com/file/d/13JP_5u7E8oZO6vRg0ekaTgBDQOaj-W00/view?usp=sharing

The file is about 1.7GB and will expand to about 4.4GB. This file willbe accessible for at least a week, and I hope you dont hit any googledrive download limits trying to retrieve it.

The embeddings were generated using my openAI account and you arewelcome to use them for any purpose you like.


best wishes,

Kent Fitch

On Wed, Apr 12, 2023 at 4:37 PM Michael Wechner<[email protected]> wrote:


    thank you very much for your feedback!

    In a previous post (April 7) you wrote you could make availlable
    the 47K ada-002 vectors, which would be great!

    Would it make sense to setup a public gitub repo, such that others
    could use or also contribute vectors?

    Thanks

    Michael Wechner


    Am 12.04.23 um 04:51 schrieb Kent Fitch:

    I only know some characteristics of the openAI ada-002 vectors,
    although they are a very popular as
    embeddings/text-characterisations as they allow more
    accurate/"human meaningful" semantic search results with fewer
    dimensions than their predecessors - I've evaluated a few
    different embedding models, including some BERT variants, CLIP
    ViT-L-14 (with 768 dims, which was quite good), openAI's ada-001
    (1024 dims) and babbage-001 (2048 dims), and ada-002 are
    qualitatively the best, although that will certainly change!

    In any case, ada-002 vectors have interesting characteristics
    that I think mean you could confidently create synthetic vectors
    which would be hard to distinguish from "real" vectors.  I found
    this from looking at 47K ada-002 vectors generated across a full
    year (1994) of newspaper articles from the Canberra Times and
    200K wikipedia articles:
    - there is no discernible/significant correlation between values
    in any pair of dimensions
    - all but 5 of the 1536 dimensions have an almost identical
    distribution of values shown in the central blob on these graphs
    (that just show a few of these 1531 dimensions with clumped
    values and the 5 "outlier" dimensions, but all 1531 non-outlier
    dims are in there, which makes for some easy quantisation from
    float to byte if you dont want to go the full
    kmeans/clustering/Lloyds-algorithm approach):
    
https://docs.google.com/spreadsheets/d/1DyyBCbirETZSUAEGcMK__mfbUNzsU_L48V9E0SyJYGg/edit?usp=sharing
    
https://docs.google.com/spreadsheets/d/1czEAlzYdyKa6xraRLesXjNZvEzlj27TcDGiEFS1-MPs/edit?usp=sharing
    
https://docs.google.com/spreadsheets/d/1RxTjV7Sj14etCNLk1GB-m44CXJVKdXaFlg2Y6yvj3z4/edit?usp=sharing
    - the variance of the value of each dimension is characteristic:
    
https://docs.google.com/spreadsheets/d/1w5LnRUXt1cRzI9Qwm07LZ6UfszjMOgPaJot9cOGLHok/edit#gid=472178228

    This probably represents something significant about how the
    ada-002 embeddings are created, but I think it also means
    creating "realistic" values is possible.  I did not use this
    information when testing recall & performance on Lucene's HNSW
    implementation on 192m documents, as I slightly dithered the
    values of a "real" set on 47K docs and stored other fields in the
    doc that referenced the "base" document that the dithers were
    made from, and used different dithering magnitudes so that I
    could test recall with different neighbour sizes ("M"),
    construction-beamwidth and search-beamwidths.

    best regards

    Kent Fitch




    On Wed, Apr 12, 2023 at 5:08 AM Michael Wechner
    <[email protected]> wrote:

        I understand what you mean that it seems to be artificial,
        but I don't
        understand why this matters to test performance and
        scalability of the
        indexing?

        Let's assume the limit of Lucene would be 4 instead of 1024
        and there
        are only open source models generating vectors with 4
        dimensions, for
        example

        
0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814

        
0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844

        
-0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106

        
-0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114

        and now I concatenate them to vectors with 8 dimensions


        
0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814,0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844

        
-0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106,-0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114

        and normalize them to length 1.

        Why should this be any different to a model which is acting
        like a black
        box generating vectors with 8 dimensions?




        Am 11.04.23 um 19:05 schrieb Michael Sokolov:
        >> What exactly do you consider real vector data? Vector data
        which is based on texts written by humans?
        > We have plenty of text; the problem is coming up with a
        realistic
        > vector model that requires as many dimensions as people
        seem to be
        > demanding. As I said above, after surveying huggingface I
        couldn't
        > find any text-based model using more than 768 dimensions.
        So far we
        > have some ideas of generating higher-dimensional data by
        dithering or
        > concatenating existing data, but it seems artificial.
        >
        > On Tue, Apr 11, 2023 at 9:31 AM Michael Wechner
        > <[email protected]> wrote:
        >> What exactly do you consider real vector data? Vector data
        which is based on texts written by humans?
        >>
        >> I am asking, because I recently attended the following
        presentation by Anastassia Shaitarova (UZH Institute for
        Computational Linguistics,
        https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html)
        >>
        >> ----
        >>
        >> Can we Identify Machine-Generated Text? An Overview of
        Current Approaches
        >> by Anastassia Shaitarova (UZH Institute for Computational
        Linguistics)
        >>
        >> The detection of machine-generated text has become
        increasingly important due to the prevalence of automated
        content generation and its potential for misuse. In this
        talk, we will discuss the motivation for automatic detection
        of generated text. We will present the currently available
        methods, including feature-based classification as a “first
        line-of-defense.” We will provide an overview of the
        detection tools that have been made available so far and
        discuss their limitations. Finally, we will reflect on some
        open problems associated with the automatic discrimination of
        generated texts.
        >>
        >> ----
        >>
        >> and her conclusion was that it has become basically
        impossible to differentiate between text generated by humans
        and text generated by for example ChatGPT.
        >>
        >> Whereas others have a slightly different opinion, see for
        example
        >>
        >>
        https://www.wired.com/story/how-to-spot-generative-ai-text-chatgpt/
        >>
        >> But I would argue that real world and synthetic have
        become close enough that testing performance and scalability
        of indexing should be possible with synthetic data.
        >>
        >> I completely agree that we have to base our discussions
        and decisions on scientific methods and that we have to make
        sure that Lucene performs and scales well and that we
        understand the limits and what is going on under the hood.
        >>
        >> Thanks
        >>
        >> Michael W
        >>
        >>
        >>
        >>
        >>
        >> Am 11.04.23 um 14:29 schrieb Michael McCandless:
        >>
        >> +1 to test on real vector data -- if you test on synthetic
        data you draw synthetic conclusions.
        >>
        >> Can someone post the theoretical performance (CPU and RAM
        required) of HNSW construction?  Do we know/believe our HNSW
        implementation has achieved that theoretical big-O
        performance?  Maybe we have some silly performance bug that's
        causing it not to?
        >>
        >> As I understand it, HNSW makes the tradeoff of costly
        construction for faster searching, which is typically the
        right tradeoff for search use cases.  We do this in other
        parts of the Lucene index too.
        >>
        >> Lucene will do a logarithmic number of merges over time,
        i.e. each doc will be merged O(log(N)) times in its lifetime
        in the index.  We need to multiply that by the cost of
        re-building the whole HNSW graph on each merge.  BTW, other
        things in Lucene, like BKD/dimensional points, also rebuild
        the whole data structure on each merge, I think?  But, as Rob
        pointed out, stored fields merging do indeed do some sneaky
        tricks to avoid excessive block decompress/recompress on each
        merge.
        >>
        >>> As I understand it, vetoes must have technical merit. I'm
        not sure that this veto rises to "technical merit" on 2 counts:
        >> Actually I think Robert's veto stands on its technical
        merit already.  Robert's take on technical matters very much
        resonate with me, even if he is sometimes prickly in how he
        expresses them ;)
        >>
        >> His point is that we, as a dev community, are not paying
        enough attention to the indexing performance of our KNN algo
        (HNSW) and implementation, and that it is reckless to
        increase / remove limits in that state.  It is indeed a
        one-way door decision and one must confront such decisions
        with caution, especially for such a widely used base
        infrastructure as Lucene.  We don't even advertise today in
        our javadocs that you need XXX heap if you index vectors with
        dimension Y, fanout X, levels Z, etc.
        >>
        >> RAM used during merging is unaffected by dimensionality,
        but is affected by fanout, because the HNSW graph (not the
        raw vectors) is memory resident, I think?  Maybe we could
        move it off-heap and let the OS manage the memory (and still
        document the RAM requirements)?  Maybe merge RAM costs should
        be accounted for in IW's RAM buffer accounting?  It is not
        today, and there are some other things that use non-trivial
        RAM, e.g. the doc mapping (to compress docid space when
        deletions are reclaimed).
        >>
        >> When we added KNN vector testing to Lucene's nightly
        benchmarks, the indexing time massively increased -- see
        annotations DH and DP here:
        https://home.apache.org/~mikemccand/lucenebench/indexing.html.
        Nightly benchmarks now start at 6 PM and don't finish until
        ~14.5 hours later.  Of course, that is using a single thread
        for indexing (on a box that has 128 cores!) so we produce a
        deterministic index every night ...
        >>
        >> Stepping out (meta) a bit ... this discussion is precisely
        one of the awesome benefits of the (informed) veto.  It means
        risky changes to the software, as determined by any single
        informed developer on the project, can force a healthy
        discussion about the problem at hand.  Robert is legitimately
        concerned about a real issue and so we should use our
        creative energies to characterize our HNSW implementation's
        performance, document it clearly for users, and uncover ways
        to improve it.
        >>
        >> Mike McCandless
        >>
        >> http://blog.mikemccandless.com
        >>
        >>
        >> On Mon, Apr 10, 2023 at 6:41 PM Alessandro Benedetti
        <[email protected]> wrote:
        >>> I think Gus points are on target.
        >>>
        >>> I recommend we move this forward in this way:
        >>> We stop any discussion and everyone interested proposes
        an option with a motivation, then we aggregate the options
        and we create a Vote maybe?
        >>>
        >>> I am also on the same page on the fact that a veto should
        come with a clear and reasonable technical merit, which also
        in my opinion has not come yet.
        >>>
        >>> I also apologise if any of my words sounded harsh or
        personal attacks, never meant to do so.
        >>>
        >>> My proposed option:
        >>>
        >>> 1) remove the limit and potentially make it configurable,
        >>> Motivation:
        >>> The system administrator can enforce a limit its users
        need to respect that it's in line with whatever the admin
        decided to be acceptable for them.
        >>> Default can stay the current one.
        >>>
        >>> That's my favourite at the moment, but I agree that
        potentially in the future this may need to change, as we may
        optimise the data structures for certain dimensions. I  am a
        big fan of Yagni (you aren't going to need it) so I am ok
        we'll face a different discussion if that happens in the future.
        >>>
        >>>
        >>>
        >>> On Sun, 9 Apr 2023, 18:46 Gus Heck, <[email protected]>
        wrote:
        >>>> What I see so far:
        >>>>
        >>>> Much positive support for raising the limit
        >>>> Slightly less support for removing it or making it
        configurable
        >>>> A single veto which argues that a (as yet undefined)
        performance standard must be met before raising the limit
        >>>> Hot tempers (various) making this discussion difficult
        >>>>
        >>>> As I understand it, vetoes must have technical merit.
        I'm not sure that this veto rises to "technical merit" on 2
        counts:
        >>>>
        >>>> No standard for the performance is given so it cannot be
        technically met. Without hard criteria it's a moving target.
        >>>> It appears to encode a valuation of the user's time, and
        that valuation is really up to the user. Some users may
        consider 2hours useless and not worth it, and others might
        happily wait 2 hours. This is not a technical decision, it's
        a business decision regarding the relative value of the time
        invested vs the value of the result. If I can cure cancer by
        indexing for a year, that might be worth it... (hyperbole of
        course).
        >>>>
        >>>> Things I would consider to have technical merit that I
        don't hear:
        >>>>
        >>>> Impact on the speed of **other** indexing operations.
        (devaluation of other functionality)
        >>>> Actual scenarios that work when the limit is low and
        fail when the limit is high (new failure on the same data
        with the limit raised).
        >>>>
        >>>> One thing that might or might not have technical merit
        >>>>
        >>>> If someone feels there is a lack of documentation of the
        costs/performance implications of using large vectors,
        possibly including reproducible benchmarks establishing the
        scaling behavior (there seems to be disagreement on O(n) vs
        O(n^2)).
        >>>>
        >>>> The users *should* know what they are getting into, but
        if the cost is worth it to them, they should be able to pay
        it without forking the project. If this veto causes a fork
        that's not good.
        >>>>
        >>>> On Sun, Apr 9, 2023 at 7:55 AM Michael Sokolov
        <[email protected]> wrote:
        >>>>> We do have a dataset built from Wikipedia in
        luceneutil. It comes in 100 and 300 dimensional varieties and
        can easily enough generate large numbers of vector documents
        from the articles data. To go higher we could concatenate
        vectors from that and I believe the performance numbers would
        be plausible.
        >>>>>
        >>>>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss
        <[email protected]> wrote:
        >>>>>> Can we set up a branch in which the limit is bumped to
        2048, then have
        >>>>>> a realistic, free data set (wikipedia sample or
        something) that has,
        >>>>>> say, 5 million docs and vectors created using public
        data (glove
        >>>>>> pre-trained embeddings or the like)? We then could run
        indexing on the
        >>>>>> same hardware with 512, 1024 and 2048 and see what the
        numbers, limits
        >>>>>> and behavior actually are.
        >>>>>>
        >>>>>> I can help in writing this but not until after Easter.
        >>>>>>
        >>>>>>
        >>>>>> Dawid
        >>>>>>
        >>>>>> On Sat, Apr 8, 2023 at 11:29 PM Adrien Grand
        <[email protected]> wrote:
        >>>>>>> As Dawid pointed out earlier on this thread, this is
        the rule for
        >>>>>>> Apache projects: a single -1 vote on a code change is
        a veto and
        >>>>>>> cannot be overridden. Furthermore, Robert is one of
        the people on this
        >>>>>>> project who worked the most on debugging subtle bugs,
        making Lucene
        >>>>>>> more robust and improving our test framework, so I'm
        listening when he
        >>>>>>> voices quality concerns.
        >>>>>>>
        >>>>>>> The argument against removing/raising the limit that
        resonates with me
        >>>>>>> the most is that it is a one-way door. As MikeS
        highlighted earlier on
        >>>>>>> this thread, implementations may want to take
        advantage of the fact
        >>>>>>> that there is a limit at some point too. This is why
        I don't want to
        >>>>>>> remove the limit and would prefer a slight increase,
        such as 2048 as
        >>>>>>> suggested in the original issue, which would enable
        most of the things
        >>>>>>> that users who have been asking about raising the
        limit would like to
        >>>>>>> do.
        >>>>>>>
        >>>>>>> I agree that the merge-time memory usage and slow
        indexing rate are
        >>>>>>> not great. But it's still possible to index
        multi-million vector
        >>>>>>> datasets with a 4GB heap without hitting OOMEs
        regardless of the
        >>>>>>> number of dimensions, and the feedback I'm seeing is
        that many users
        >>>>>>> are still interested in indexing multi-million vector
        datasets despite
        >>>>>>> the slow indexing rate. I wish we could do better,
        and vector indexing
        >>>>>>> is certainly more expert than text indexing, but it
        still is usable in
        >>>>>>> my opinion. I understand how giving Lucene more
        information about
        >>>>>>> vectors prior to indexing (e.g. clustering
        information as Jim pointed
        >>>>>>> out) could help make merging faster and more
        memory-efficient, but I
        >>>>>>> would really like to avoid making it a requirement
        for indexing
        >>>>>>> vectors as it also makes this feature much harder to use.
        >>>>>>>
        >>>>>>> On Sat, Apr 8, 2023 at 9:28 PM Alessandro Benedetti
        >>>>>>> <[email protected]> wrote:
        >>>>>>>> I am very attentive to listen opinions but I am
        un-convinced here and I an not sure that a single person
        opinion should be allowed to be detrimental for such an
        important project.
        >>>>>>>>
        >>>>>>>> The limit as far as I know is literally just raising
        an exception.
        >>>>>>>> Removing it won't alter in any way the current
        performance for users in low dimensional space.
        >>>>>>>> Removing it will just enable more users to use Lucene.
        >>>>>>>>
        >>>>>>>> If new users in certain situations will be unhappy
        with the performance, they may contribute improvements.
        >>>>>>>> This is how you make progress.
        >>>>>>>>
        >>>>>>>> If it's a reputation thing, trust me that not
        allowing users to play with high dimensional space will
        equally damage it.
        >>>>>>>>
        >>>>>>>> To me it's really a no brainer.
        >>>>>>>> Removing the limit and enable people to use high
        dimensional vectors will take minutes.
        >>>>>>>> Improving the hnsw implementation can take months.
        >>>>>>>> Pick one to begin with...
        >>>>>>>>
        >>>>>>>> And there's no-one paying me here, no company
        interest whatsoever, actually I pay people to contribute, I
        am just convinced it's a good idea.
        >>>>>>>>
        >>>>>>>>
        >>>>>>>> On Sat, 8 Apr 2023, 18:57 Robert Muir,
        <[email protected]> wrote:
        >>>>>>>>> I disagree with your categorization. I put in
        plenty of work and
        >>>>>>>>> experienced plenty of pain myself, writing tests
        and fighting these
        >>>>>>>>> issues, after i saw that, two releases in a row,
        vector indexing fell
        >>>>>>>>> over and hit integer overflows etc on small datasets:
        >>>>>>>>>
        >>>>>>>>> https://github.com/apache/lucene/pull/11905
        >>>>>>>>>
        >>>>>>>>> Attacking me isn't helping the situation.
        >>>>>>>>>
        >>>>>>>>> PS: when i said the "one guy who wrote the code" I
        didn't mean it in
        >>>>>>>>> any kind of demeaning fashion really. I meant to
        describe the current
        >>>>>>>>> state of usability with respect to indexing a few
        million docs with
        >>>>>>>>> high dimensions. You can scroll up the thread and
        see that at least
        >>>>>>>>> one other committer on the project experienced
        similar pain as me.
        >>>>>>>>> Then, think about users who aren't committers
        trying to use the
        >>>>>>>>> functionality!
        >>>>>>>>>
        >>>>>>>>> On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov
        <[email protected]> wrote:
        >>>>>>>>>> What you said about increasing dimensions
        requiring a bigger ram buffer on merge is wrong. That's the
        point I was trying to make. Your concerns about merge costs
        are not wrong, but your conclusion that we need to limit
        dimensions is not justified.
        >>>>>>>>>>
        >>>>>>>>>> You complain that hnsw sucks it doesn't scale, but
        when I show it scales linearly with dimension you just ignore
        that and complain about something entirely different.
        >>>>>>>>>>
        >>>>>>>>>> You demand that people run all kinds of tests to
        prove you wrong but when they do, you don't listen and you
        won't put in the work yourself or complain that it's too hard.
        >>>>>>>>>>
        >>>>>>>>>> Then you complain about people not meeting you
        half way. Wow
        >>>>>>>>>>
        >>>>>>>>>> On Sat, Apr 8, 2023, 12:40 PM Robert Muir
        <[email protected]> wrote:
        >>>>>>>>>>> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner
        >>>>>>>>>>> <[email protected]> wrote:
        >>>>>>>>>>>> What exactly do you consider reasonable?
        >>>>>>>>>>> Let's begin a real discussion by being HONEST
        about the current
        >>>>>>>>>>> status. Please put politically correct or your
        own company's wishes
        >>>>>>>>>>> aside, we know it's not in a good state.
        >>>>>>>>>>>
        >>>>>>>>>>> Current status is the one guy who wrote the code
        can set a
        >>>>>>>>>>> multi-gigabyte ram buffer and index a small
        dataset with 1024
        >>>>>>>>>>> dimensions in HOURS (i didn't ask what hardware).
        >>>>>>>>>>>
        >>>>>>>>>>> My concerns are everyone else except the one guy,
        I want it to be
        >>>>>>>>>>> usable. Increasing dimensions just means even
        bigger multi-gigabyte
        >>>>>>>>>>> ram buffer and bigger heap to avoid OOM on merge.
        >>>>>>>>>>> It is also a permanent backwards compatibility
        decision, we have to
        >>>>>>>>>>> support it once we do this and we can't just say
        "oops" and flip it
        >>>>>>>>>>> back.
        >>>>>>>>>>>
        >>>>>>>>>>> It is unclear to me, if the multi-gigabyte ram
        buffer is really to
        >>>>>>>>>>> avoid merges because they are so slow and it
        would be DAYS otherwise,
        >>>>>>>>>>> or if its to avoid merges so it doesn't hit OOM.
        >>>>>>>>>>> Also from personal experience, it takes trial and
        error (means
        >>>>>>>>>>> experiencing OOM on merge!!!) before you get
        those heap values correct
        >>>>>>>>>>> for your dataset. This usually means starting
        over which is
        >>>>>>>>>>> frustrating and wastes more time.
        >>>>>>>>>>>
        >>>>>>>>>>> Jim mentioned some ideas about the memory usage
        in IndexWriter, seems
        >>>>>>>>>>> to me like its a good idea. maybe the
        multigigabyte ram buffer can be
        >>>>>>>>>>> avoided in this way and performance improved by
        writing bigger
        >>>>>>>>>>> segments with lucene's defaults. But this doesn't
        mean we can simply
        >>>>>>>>>>> ignore the horrors of what happens on merge.
        merging needs to scale so
        >>>>>>>>>>> that indexing really scales.
        >>>>>>>>>>>
        >>>>>>>>>>> At least it shouldnt spike RAM on trivial data
        amounts and cause OOM,
        >>>>>>>>>>> and definitely it shouldnt burn hours and hours
        of CPU in O(n^2)
        >>>>>>>>>>> fashion when indexing.
        >>>>>>>>>>>
        >>>>>>>>>>>
        ---------------------------------------------------------------------
        >>>>>>>>>>> To unsubscribe, e-mail:
        [email protected]
        >>>>>>>>>>> For additional commands, e-mail:
        [email protected]
        >>>>>>>>>>>
        >>>>>>>>>
        ---------------------------------------------------------------------
        >>>>>>>>> To unsubscribe, e-mail:
        [email protected]
        >>>>>>>>> For additional commands, e-mail:
        [email protected]
        >>>>>>>>>
        >>>>>>>
        >>>>>>> --
        >>>>>>> Adrien
        >>>>>>>
        >>>>>>>
        ---------------------------------------------------------------------
        >>>>>>> To unsubscribe, e-mail: [email protected]
        >>>>>>> For additional commands, e-mail:
        [email protected]
        >>>>>>>
        >>>>>>
        ---------------------------------------------------------------------
        >>>>>> To unsubscribe, e-mail: [email protected]
        >>>>>> For additional commands, e-mail:
        [email protected]
        >>>>>>
        >>>>
        >>>> --
        >>>> http://www.needhamsoftware.com (work)
        >>>> http://www.the111shift.com (play)
        >>
        >
        ---------------------------------------------------------------------
        > To unsubscribe, e-mail: [email protected]
        > For additional commands, e-mail: [email protected]
        >


        ---------------------------------------------------------------------
        To unsubscribe, e-mail: [email protected]
        For additional commands, e-mail: [email protected]

Re: [Proposal] Remove max number of dimensions for KNN vectors

Reply via email to