thank you very much for your feedback!

In a previous post (April 7) you wrote you could make availlable the 47K ada-002 vectors, which would be great!

Would it make sense to setup a public gitub repo, such that others could use or also contribute vectors?

Thanks

Michael Wechner


Am 12.04.23 um 04:51 schrieb Kent Fitch:
I only know some characteristics of the openAI ada-002 vectors, although they are a very popular as embeddings/text-characterisations as they allow more accurate/"human meaningful" semantic search results with fewer dimensions than their predecessors - I've evaluated a few different embedding models, including some BERT variants, CLIP ViT-L-14 (with 768 dims, which was quite good), openAI's ada-001 (1024 dims) and babbage-001 (2048 dims), and ada-002 are qualitatively the best, although that will certainly change!

In any case, ada-002 vectors have interesting characteristics that I think mean you could confidently create synthetic vectors which would be hard to distinguish from "real" vectors.  I found this from looking at 47K ada-002 vectors generated across a full year (1994) of newspaper articles from the Canberra Times and 200K wikipedia articles: - there is no discernible/significant correlation between values in any pair of dimensions - all but 5 of the 1536 dimensions have an almost identical distribution of values shown in the central blob on these graphs (that just show a few of these 1531 dimensions with clumped values and the 5 "outlier" dimensions, but all 1531 non-outlier dims are in there, which makes for some easy quantisation from float to byte if you dont want to go the full kmeans/clustering/Lloyds-algorithm approach):
https://docs.google.com/spreadsheets/d/1DyyBCbirETZSUAEGcMK__mfbUNzsU_L48V9E0SyJYGg/edit?usp=sharing
https://docs.google.com/spreadsheets/d/1czEAlzYdyKa6xraRLesXjNZvEzlj27TcDGiEFS1-MPs/edit?usp=sharing
https://docs.google.com/spreadsheets/d/1RxTjV7Sj14etCNLk1GB-m44CXJVKdXaFlg2Y6yvj3z4/edit?usp=sharing
- the variance of the value of each dimension is characteristic:
https://docs.google.com/spreadsheets/d/1w5LnRUXt1cRzI9Qwm07LZ6UfszjMOgPaJot9cOGLHok/edit#gid=472178228

This probably represents something significant about how the ada-002 embeddings are created, but I think it also means creating "realistic" values is possible.  I did not use this information when testing recall & performance on Lucene's HNSW implementation on 192m documents, as I slightly dithered the values of a "real" set on 47K docs and stored other fields in the doc that referenced the "base" document that the dithers were made from, and used different dithering magnitudes so that I could test recall with different neighbour sizes ("M"), construction-beamwidth and search-beamwidths.

best regards

Kent Fitch




On Wed, Apr 12, 2023 at 5:08 AM Michael Wechner <michael.wech...@wyona.com> wrote:

    I understand what you mean that it seems to be artificial, but I
    don't
    understand why this matters to test performance and scalability of
    the
    indexing?

    Let's assume the limit of Lucene would be 4 instead of 1024 and there
    are only open source models generating vectors with 4 dimensions, for
    example

    
0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814

    
0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844

    
-0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106

    
-0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114

    and now I concatenate them to vectors with 8 dimensions


    
0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814,0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844

    
-0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106,-0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114

    and normalize them to length 1.

    Why should this be any different to a model which is acting like a
    black
    box generating vectors with 8 dimensions?




    Am 11.04.23 um 19:05 schrieb Michael Sokolov:
    >> What exactly do you consider real vector data? Vector data
    which is based on texts written by humans?
    > We have plenty of text; the problem is coming up with a realistic
    > vector model that requires as many dimensions as people seem to be
    > demanding. As I said above, after surveying huggingface I couldn't
    > find any text-based model using more than 768 dimensions. So far we
    > have some ideas of generating higher-dimensional data by
    dithering or
    > concatenating existing data, but it seems artificial.
    >
    > On Tue, Apr 11, 2023 at 9:31 AM Michael Wechner
    > <michael.wech...@wyona.com> wrote:
    >> What exactly do you consider real vector data? Vector data
    which is based on texts written by humans?
    >>
    >> I am asking, because I recently attended the following
    presentation by Anastassia Shaitarova (UZH Institute for
    Computational Linguistics,
    https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html)
    >>
    >> ----
    >>
    >> Can we Identify Machine-Generated Text? An Overview of Current
    Approaches
    >> by Anastassia Shaitarova (UZH Institute for Computational
    Linguistics)
    >>
    >> The detection of machine-generated text has become increasingly
    important due to the prevalence of automated content generation
    and its potential for misuse. In this talk, we will discuss the
    motivation for automatic detection of generated text. We will
    present the currently available methods, including feature-based
    classification as a “first line-of-defense.” We will provide an
    overview of the detection tools that have been made available so
    far and discuss their limitations. Finally, we will reflect on
    some open problems associated with the automatic discrimination of
    generated texts.
    >>
    >> ----
    >>
    >> and her conclusion was that it has become basically impossible
    to differentiate between text generated by humans and text
    generated by for example ChatGPT.
    >>
    >> Whereas others have a slightly different opinion, see for example
    >>
    >> https://www.wired.com/story/how-to-spot-generative-ai-text-chatgpt/
    >>
    >> But I would argue that real world and synthetic have become
    close enough that testing performance and scalability of indexing
    should be possible with synthetic data.
    >>
    >> I completely agree that we have to base our discussions and
    decisions on scientific methods and that we have to make sure that
    Lucene performs and scales well and that we understand the limits
    and what is going on under the hood.
    >>
    >> Thanks
    >>
    >> Michael W
    >>
    >>
    >>
    >>
    >>
    >> Am 11.04.23 um 14:29 schrieb Michael McCandless:
    >>
    >> +1 to test on real vector data -- if you test on synthetic data
    you draw synthetic conclusions.
    >>
    >> Can someone post the theoretical performance (CPU and RAM
    required) of HNSW construction?  Do we know/believe our HNSW
    implementation has achieved that theoretical big-O performance? 
    Maybe we have some silly performance bug that's causing it not to?
    >>
    >> As I understand it, HNSW makes the tradeoff of costly
    construction for faster searching, which is typically the right
    tradeoff for search use cases.  We do this in other parts of the
    Lucene index too.
    >>
    >> Lucene will do a logarithmic number of merges over time, i.e.
    each doc will be merged O(log(N)) times in its lifetime in the
    index.  We need to multiply that by the cost of re-building the
    whole HNSW graph on each merge.  BTW, other things in Lucene, like
    BKD/dimensional points, also rebuild the whole data structure on
    each merge, I think?  But, as Rob pointed out, stored fields
    merging do indeed do some sneaky tricks to avoid excessive block
    decompress/recompress on each merge.
    >>
    >>> As I understand it, vetoes must have technical merit. I'm not
    sure that this veto rises to "technical merit" on 2 counts:
    >> Actually I think Robert's veto stands on its technical merit
    already.  Robert's take on technical matters very much resonate
    with me, even if he is sometimes prickly in how he expresses them ;)
    >>
    >> His point is that we, as a dev community, are not paying enough
    attention to the indexing performance of our KNN algo (HNSW) and
    implementation, and that it is reckless to increase / remove
    limits in that state.  It is indeed a one-way door decision and
    one must confront such decisions with caution, especially for such
    a widely used base infrastructure as Lucene.  We don't even
    advertise today in our javadocs that you need XXX heap if you
    index vectors with dimension Y, fanout X, levels Z, etc.
    >>
    >> RAM used during merging is unaffected by dimensionality, but is
    affected by fanout, because the HNSW graph (not the raw vectors)
    is memory resident, I think? Maybe we could move it off-heap and
    let the OS manage the memory (and still document the RAM
    requirements)?  Maybe merge RAM costs should be accounted for in
    IW's RAM buffer accounting?  It is not today, and there are some
    other things that use non-trivial RAM, e.g. the doc mapping (to
    compress docid space when deletions are reclaimed).
    >>
    >> When we added KNN vector testing to Lucene's nightly
    benchmarks, the indexing time massively increased -- see
    annotations DH and DP here:
    https://home.apache.org/~mikemccand/lucenebench/indexing.html.
    Nightly benchmarks now start at 6 PM and don't finish until ~14.5
    hours later.  Of course, that is using a single thread for
    indexing (on a box that has 128 cores!) so we produce a
    deterministic index every night ...
    >>
    >> Stepping out (meta) a bit ... this discussion is precisely one
    of the awesome benefits of the (informed) veto. It means risky
    changes to the software, as determined by any single informed
    developer on the project, can force a healthy discussion about the
    problem at hand.  Robert is legitimately concerned about a real
    issue and so we should use our creative energies to characterize
    our HNSW implementation's performance, document it clearly for
    users, and uncover ways to improve it.
    >>
    >> Mike McCandless
    >>
    >> http://blog.mikemccandless.com
    >>
    >>
    >> On Mon, Apr 10, 2023 at 6:41 PM Alessandro Benedetti
    <a.benede...@sease.io> wrote:
    >>> I think Gus points are on target.
    >>>
    >>> I recommend we move this forward in this way:
    >>> We stop any discussion and everyone interested proposes an
    option with a motivation, then we aggregate the options and we
    create a Vote maybe?
    >>>
    >>> I am also on the same page on the fact that a veto should come
    with a clear and reasonable technical merit, which also in my
    opinion has not come yet.
    >>>
    >>> I also apologise if any of my words sounded harsh or personal
    attacks, never meant to do so.
    >>>
    >>> My proposed option:
    >>>
    >>> 1) remove the limit and potentially make it configurable,
    >>> Motivation:
    >>> The system administrator can enforce a limit its users need to
    respect that it's in line with whatever the admin decided to be
    acceptable for them.
    >>> Default can stay the current one.
    >>>
    >>> That's my favourite at the moment, but I agree that
    potentially in the future this may need to change, as we may
    optimise the data structures for certain dimensions. I  am a big
    fan of Yagni (you aren't going to need it) so I am ok we'll face a
    different discussion if that happens in the future.
    >>>
    >>>
    >>>
    >>> On Sun, 9 Apr 2023, 18:46 Gus Heck, <gus.h...@gmail.com> wrote:
    >>>> What I see so far:
    >>>>
    >>>> Much positive support for raising the limit
    >>>> Slightly less support for removing it or making it configurable
    >>>> A single veto which argues that a (as yet undefined)
    performance standard must be met before raising the limit
    >>>> Hot tempers (various) making this discussion difficult
    >>>>
    >>>> As I understand it, vetoes must have technical merit. I'm not
    sure that this veto rises to "technical merit" on 2 counts:
    >>>>
    >>>> No standard for the performance is given so it cannot be
    technically met. Without hard criteria it's a moving target.
    >>>> It appears to encode a valuation of the user's time, and that
    valuation is really up to the user. Some users may consider 2hours
    useless and not worth it, and others might happily wait 2 hours.
    This is not a technical decision, it's a business decision
    regarding the relative value of the time invested vs the value of
    the result. If I can cure cancer by indexing for a year, that
    might be worth it... (hyperbole of course).
    >>>>
    >>>> Things I would consider to have technical merit that I don't
    hear:
    >>>>
    >>>> Impact on the speed of **other** indexing operations.
    (devaluation of other functionality)
    >>>> Actual scenarios that work when the limit is low and fail
    when the limit is high (new failure on the same data with the
    limit raised).
    >>>>
    >>>> One thing that might or might not have technical merit
    >>>>
    >>>> If someone feels there is a lack of documentation of the
    costs/performance implications of using large vectors, possibly
    including reproducible benchmarks establishing the scaling
    behavior (there seems to be disagreement on O(n) vs O(n^2)).
    >>>>
    >>>> The users *should* know what they are getting into, but if
    the cost is worth it to them, they should be able to pay it
    without forking the project. If this veto causes a fork that's not
    good.
    >>>>
    >>>> On Sun, Apr 9, 2023 at 7:55 AM Michael Sokolov
    <msoko...@gmail.com> wrote:
    >>>>> We do have a dataset built from Wikipedia in luceneutil. It
    comes in 100 and 300 dimensional varieties and can easily enough
    generate large numbers of vector documents from the articles data.
    To go higher we could concatenate vectors from that and I believe
    the performance numbers would be plausible.
    >>>>>
    >>>>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss
    <dawid.we...@gmail.com> wrote:
    >>>>>> Can we set up a branch in which the limit is bumped to
    2048, then have
    >>>>>> a realistic, free data set (wikipedia sample or something)
    that has,
    >>>>>> say, 5 million docs and vectors created using public data
    (glove
    >>>>>> pre-trained embeddings or the like)? We then could run
    indexing on the
    >>>>>> same hardware with 512, 1024 and 2048 and see what the
    numbers, limits
    >>>>>> and behavior actually are.
    >>>>>>
    >>>>>> I can help in writing this but not until after Easter.
    >>>>>>
    >>>>>>
    >>>>>> Dawid
    >>>>>>
    >>>>>> On Sat, Apr 8, 2023 at 11:29 PM Adrien Grand
    <jpou...@gmail.com> wrote:
    >>>>>>> As Dawid pointed out earlier on this thread, this is the
    rule for
    >>>>>>> Apache projects: a single -1 vote on a code change is a
    veto and
    >>>>>>> cannot be overridden. Furthermore, Robert is one of the
    people on this
    >>>>>>> project who worked the most on debugging subtle bugs,
    making Lucene
    >>>>>>> more robust and improving our test framework, so I'm
    listening when he
    >>>>>>> voices quality concerns.
    >>>>>>>
    >>>>>>> The argument against removing/raising the limit that
    resonates with me
    >>>>>>> the most is that it is a one-way door. As MikeS
    highlighted earlier on
    >>>>>>> this thread, implementations may want to take advantage of
    the fact
    >>>>>>> that there is a limit at some point too. This is why I
    don't want to
    >>>>>>> remove the limit and would prefer a slight increase, such
    as 2048 as
    >>>>>>> suggested in the original issue, which would enable most
    of the things
    >>>>>>> that users who have been asking about raising the limit
    would like to
    >>>>>>> do.
    >>>>>>>
    >>>>>>> I agree that the merge-time memory usage and slow indexing
    rate are
    >>>>>>> not great. But it's still possible to index multi-million
    vector
    >>>>>>> datasets with a 4GB heap without hitting OOMEs regardless
    of the
    >>>>>>> number of dimensions, and the feedback I'm seeing is that
    many users
    >>>>>>> are still interested in indexing multi-million vector
    datasets despite
    >>>>>>> the slow indexing rate. I wish we could do better, and
    vector indexing
    >>>>>>> is certainly more expert than text indexing, but it still
    is usable in
    >>>>>>> my opinion. I understand how giving Lucene more
    information about
    >>>>>>> vectors prior to indexing (e.g. clustering information as
    Jim pointed
    >>>>>>> out) could help make merging faster and more
    memory-efficient, but I
    >>>>>>> would really like to avoid making it a requirement for
    indexing
    >>>>>>> vectors as it also makes this feature much harder to use.
    >>>>>>>
    >>>>>>> On Sat, Apr 8, 2023 at 9:28 PM Alessandro Benedetti
    >>>>>>> <a.benede...@sease.io> wrote:
    >>>>>>>> I am very attentive to listen opinions but I am
    un-convinced here and I an not sure that a single person opinion
    should be allowed to be detrimental for such an important project.
    >>>>>>>>
    >>>>>>>> The limit as far as I know is literally just raising an
    exception.
    >>>>>>>> Removing it won't alter in any way the current
    performance for users in low dimensional space.
    >>>>>>>> Removing it will just enable more users to use Lucene.
    >>>>>>>>
    >>>>>>>> If new users in certain situations will be unhappy with
    the performance, they may contribute improvements.
    >>>>>>>> This is how you make progress.
    >>>>>>>>
    >>>>>>>> If it's a reputation thing, trust me that not allowing
    users to play with high dimensional space will equally damage it.
    >>>>>>>>
    >>>>>>>> To me it's really a no brainer.
    >>>>>>>> Removing the limit and enable people to use high
    dimensional vectors will take minutes.
    >>>>>>>> Improving the hnsw implementation can take months.
    >>>>>>>> Pick one to begin with...
    >>>>>>>>
    >>>>>>>> And there's no-one paying me here, no company interest
    whatsoever, actually I pay people to contribute, I am just
    convinced it's a good idea.
    >>>>>>>>
    >>>>>>>>
    >>>>>>>> On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcm...@gmail.com>
    wrote:
    >>>>>>>>> I disagree with your categorization. I put in plenty of
    work and
    >>>>>>>>> experienced plenty of pain myself, writing tests and
    fighting these
    >>>>>>>>> issues, after i saw that, two releases in a row, vector
    indexing fell
    >>>>>>>>> over and hit integer overflows etc on small datasets:
    >>>>>>>>>
    >>>>>>>>> https://github.com/apache/lucene/pull/11905
    >>>>>>>>>
    >>>>>>>>> Attacking me isn't helping the situation.
    >>>>>>>>>
    >>>>>>>>> PS: when i said the "one guy who wrote the code" I
    didn't mean it in
    >>>>>>>>> any kind of demeaning fashion really. I meant to
    describe the current
    >>>>>>>>> state of usability with respect to indexing a few
    million docs with
    >>>>>>>>> high dimensions. You can scroll up the thread and see
    that at least
    >>>>>>>>> one other committer on the project experienced similar
    pain as me.
    >>>>>>>>> Then, think about users who aren't committers trying to
    use the
    >>>>>>>>> functionality!
    >>>>>>>>>
    >>>>>>>>> On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov
    <msoko...@gmail.com> wrote:
    >>>>>>>>>> What you said about increasing dimensions requiring a
    bigger ram buffer on merge is wrong. That's the point I was trying
    to make. Your concerns about merge costs are not wrong, but your
    conclusion that we need to limit dimensions is not justified.
    >>>>>>>>>>
    >>>>>>>>>> You complain that hnsw sucks it doesn't scale, but when
    I show it scales linearly with dimension you just ignore that and
    complain about something entirely different.
    >>>>>>>>>>
    >>>>>>>>>> You demand that people run all kinds of tests to prove
    you wrong but when they do, you don't listen and you won't put in
    the work yourself or complain that it's too hard.
    >>>>>>>>>>
    >>>>>>>>>> Then you complain about people not meeting you half
    way. Wow
    >>>>>>>>>>
    >>>>>>>>>> On Sat, Apr 8, 2023, 12:40 PM Robert Muir
    <rcm...@gmail.com> wrote:
    >>>>>>>>>>> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner
    >>>>>>>>>>> <michael.wech...@wyona.com> wrote:
    >>>>>>>>>>>> What exactly do you consider reasonable?
    >>>>>>>>>>> Let's begin a real discussion by being HONEST about
    the current
    >>>>>>>>>>> status. Please put politically correct or your own
    company's wishes
    >>>>>>>>>>> aside, we know it's not in a good state.
    >>>>>>>>>>>
    >>>>>>>>>>> Current status is the one guy who wrote the code can set a
    >>>>>>>>>>> multi-gigabyte ram buffer and index a small dataset
    with 1024
    >>>>>>>>>>> dimensions in HOURS (i didn't ask what hardware).
    >>>>>>>>>>>
    >>>>>>>>>>> My concerns are everyone else except the one guy, I
    want it to be
    >>>>>>>>>>> usable. Increasing dimensions just means even bigger
    multi-gigabyte
    >>>>>>>>>>> ram buffer and bigger heap to avoid OOM on merge.
    >>>>>>>>>>> It is also a permanent backwards compatibility
    decision, we have to
    >>>>>>>>>>> support it once we do this and we can't just say
    "oops" and flip it
    >>>>>>>>>>> back.
    >>>>>>>>>>>
    >>>>>>>>>>> It is unclear to me, if the multi-gigabyte ram buffer
    is really to
    >>>>>>>>>>> avoid merges because they are so slow and it would be
    DAYS otherwise,
    >>>>>>>>>>> or if its to avoid merges so it doesn't hit OOM.
    >>>>>>>>>>> Also from personal experience, it takes trial and
    error (means
    >>>>>>>>>>> experiencing OOM on merge!!!) before you get those
    heap values correct
    >>>>>>>>>>> for your dataset. This usually means starting over
    which is
    >>>>>>>>>>> frustrating and wastes more time.
    >>>>>>>>>>>
    >>>>>>>>>>> Jim mentioned some ideas about the memory usage in
    IndexWriter, seems
    >>>>>>>>>>> to me like its a good idea. maybe the multigigabyte
    ram buffer can be
    >>>>>>>>>>> avoided in this way and performance improved by
    writing bigger
    >>>>>>>>>>> segments with lucene's defaults. But this doesn't mean
    we can simply
    >>>>>>>>>>> ignore the horrors of what happens on merge. merging
    needs to scale so
    >>>>>>>>>>> that indexing really scales.
    >>>>>>>>>>>
    >>>>>>>>>>> At least it shouldnt spike RAM on trivial data amounts
    and cause OOM,
    >>>>>>>>>>> and definitely it shouldnt burn hours and hours of CPU
    in O(n^2)
    >>>>>>>>>>> fashion when indexing.
    >>>>>>>>>>>
    >>>>>>>>>>>
    ---------------------------------------------------------------------
    >>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
    >>>>>>>>>>> For additional commands, e-mail:
    dev-h...@lucene.apache.org
    >>>>>>>>>>>
    >>>>>>>>>
    ---------------------------------------------------------------------
    >>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
    >>>>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
    >>>>>>>>>
    >>>>>>>
    >>>>>>> --
    >>>>>>> Adrien
    >>>>>>>
    >>>>>>>
    ---------------------------------------------------------------------
    >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
    >>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
    >>>>>>>
    >>>>>>
    ---------------------------------------------------------------------
    >>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
    >>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
    >>>>>>
    >>>>
    >>>> --
    >>>> http://www.needhamsoftware.com (work)
    >>>> http://www.the111shift.com (play)
    >>
    >
    ---------------------------------------------------------------------
    > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
    > For additional commands, e-mail: dev-h...@lucene.apache.org
    >


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
    For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to