I only know some characteristics of the openAI ada-002 vectors, although they are a very popular as embeddings/text-characterisations as they allow more accurate/"human meaningful" semantic search results with fewer dimensions than their predecessors - I've evaluated a few different embedding models, including some BERT variants, CLIP ViT-L-14 (with 768 dims, which was quite good), openAI's ada-001 (1024 dims) and babbage-001 (2048 dims), and ada-002 are qualitatively the best, although that will certainly change!
In any case, ada-002 vectors have interesting characteristics that I think mean you could confidently create synthetic vectors which would be hard to distinguish from "real" vectors. I found this from looking at 47K ada-002 vectors generated across a full year (1994) of newspaper articles from the Canberra Times and 200K wikipedia articles: - there is no discernible/significant correlation between values in any pair of dimensions - all but 5 of the 1536 dimensions have an almost identical distribution of values shown in the central blob on these graphs (that just show a few of these 1531 dimensions with clumped values and the 5 "outlier" dimensions, but all 1531 non-outlier dims are in there, which makes for some easy quantisation from float to byte if you dont want to go the full kmeans/clustering/Lloyds-algorithm approach): https://docs.google.com/spreadsheets/d/1DyyBCbirETZSUAEGcMK__mfbUNzsU_L48V9E0SyJYGg/edit?usp=sharing https://docs.google.com/spreadsheets/d/1czEAlzYdyKa6xraRLesXjNZvEzlj27TcDGiEFS1-MPs/edit?usp=sharing https://docs.google.com/spreadsheets/d/1RxTjV7Sj14etCNLk1GB-m44CXJVKdXaFlg2Y6yvj3z4/edit?usp=sharing - the variance of the value of each dimension is characteristic: https://docs.google.com/spreadsheets/d/1w5LnRUXt1cRzI9Qwm07LZ6UfszjMOgPaJot9cOGLHok/edit#gid=472178228 This probably represents something significant about how the ada-002 embeddings are created, but I think it also means creating "realistic" values is possible. I did not use this information when testing recall & performance on Lucene's HNSW implementation on 192m documents, as I slightly dithered the values of a "real" set on 47K docs and stored other fields in the doc that referenced the "base" document that the dithers were made from, and used different dithering magnitudes so that I could test recall with different neighbour sizes ("M"), construction-beamwidth and search-beamwidths. best regards Kent Fitch On Wed, Apr 12, 2023 at 5:08 AM Michael Wechner <michael.wech...@wyona.com> wrote: > I understand what you mean that it seems to be artificial, but I don't > understand why this matters to test performance and scalability of the > indexing? > > Let's assume the limit of Lucene would be 4 instead of 1024 and there > are only open source models generating vectors with 4 dimensions, for > example > > > 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814 > > > 0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844 > > > -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106 > > > -0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114 > > and now I concatenate them to vectors with 8 dimensions > > > > 0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814,0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844 > > > -0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106,-0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114 > > and normalize them to length 1. > > Why should this be any different to a model which is acting like a black > box generating vectors with 8 dimensions? > > > > > Am 11.04.23 um 19:05 schrieb Michael Sokolov: > >> What exactly do you consider real vector data? Vector data which is > based on texts written by humans? > > We have plenty of text; the problem is coming up with a realistic > > vector model that requires as many dimensions as people seem to be > > demanding. As I said above, after surveying huggingface I couldn't > > find any text-based model using more than 768 dimensions. So far we > > have some ideas of generating higher-dimensional data by dithering or > > concatenating existing data, but it seems artificial. > > > > On Tue, Apr 11, 2023 at 9:31 AM Michael Wechner > > <michael.wech...@wyona.com> wrote: > >> What exactly do you consider real vector data? Vector data which is > based on texts written by humans? > >> > >> I am asking, because I recently attended the following presentation by > Anastassia Shaitarova (UZH Institute for Computational Linguistics, > https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html) > >> > >> ---- > >> > >> Can we Identify Machine-Generated Text? An Overview of Current > Approaches > >> by Anastassia Shaitarova (UZH Institute for Computational Linguistics) > >> > >> The detection of machine-generated text has become increasingly > important due to the prevalence of automated content generation and its > potential for misuse. In this talk, we will discuss the motivation for > automatic detection of generated text. We will present the currently > available methods, including feature-based classification as a “first > line-of-defense.” We will provide an overview of the detection tools that > have been made available so far and discuss their limitations. Finally, we > will reflect on some open problems associated with the automatic > discrimination of generated texts. > >> > >> ---- > >> > >> and her conclusion was that it has become basically impossible to > differentiate between text generated by humans and text generated by for > example ChatGPT. > >> > >> Whereas others have a slightly different opinion, see for example > >> > >> https://www.wired.com/story/how-to-spot-generative-ai-text-chatgpt/ > >> > >> But I would argue that real world and synthetic have become close > enough that testing performance and scalability of indexing should be > possible with synthetic data. > >> > >> I completely agree that we have to base our discussions and decisions > on scientific methods and that we have to make sure that Lucene performs > and scales well and that we understand the limits and what is going on > under the hood. > >> > >> Thanks > >> > >> Michael W > >> > >> > >> > >> > >> > >> Am 11.04.23 um 14:29 schrieb Michael McCandless: > >> > >> +1 to test on real vector data -- if you test on synthetic data you > draw synthetic conclusions. > >> > >> Can someone post the theoretical performance (CPU and RAM required) of > HNSW construction? Do we know/believe our HNSW implementation has achieved > that theoretical big-O performance? Maybe we have some silly performance > bug that's causing it not to? > >> > >> As I understand it, HNSW makes the tradeoff of costly construction for > faster searching, which is typically the right tradeoff for search use > cases. We do this in other parts of the Lucene index too. > >> > >> Lucene will do a logarithmic number of merges over time, i.e. each doc > will be merged O(log(N)) times in its lifetime in the index. We need to > multiply that by the cost of re-building the whole HNSW graph on each > merge. BTW, other things in Lucene, like BKD/dimensional points, also > rebuild the whole data structure on each merge, I think? But, as Rob > pointed out, stored fields merging do indeed do some sneaky tricks to avoid > excessive block decompress/recompress on each merge. > >> > >>> As I understand it, vetoes must have technical merit. I'm not sure > that this veto rises to "technical merit" on 2 counts: > >> Actually I think Robert's veto stands on its technical merit already. > Robert's take on technical matters very much resonate with me, even if he > is sometimes prickly in how he expresses them ;) > >> > >> His point is that we, as a dev community, are not paying enough > attention to the indexing performance of our KNN algo (HNSW) and > implementation, and that it is reckless to increase / remove limits in that > state. It is indeed a one-way door decision and one must confront such > decisions with caution, especially for such a widely used base > infrastructure as Lucene. We don't even advertise today in our javadocs > that you need XXX heap if you index vectors with dimension Y, fanout X, > levels Z, etc. > >> > >> RAM used during merging is unaffected by dimensionality, but is > affected by fanout, because the HNSW graph (not the raw vectors) is memory > resident, I think? Maybe we could move it off-heap and let the OS manage > the memory (and still document the RAM requirements)? Maybe merge RAM > costs should be accounted for in IW's RAM buffer accounting? It is not > today, and there are some other things that use non-trivial RAM, e.g. the > doc mapping (to compress docid space when deletions are reclaimed). > >> > >> When we added KNN vector testing to Lucene's nightly benchmarks, the > indexing time massively increased -- see annotations DH and DP here: > https://home.apache.org/~mikemccand/lucenebench/indexing.html. Nightly > benchmarks now start at 6 PM and don't finish until ~14.5 hours later. Of > course, that is using a single thread for indexing (on a box that has 128 > cores!) so we produce a deterministic index every night ... > >> > >> Stepping out (meta) a bit ... this discussion is precisely one of the > awesome benefits of the (informed) veto. It means risky changes to the > software, as determined by any single informed developer on the project, > can force a healthy discussion about the problem at hand. Robert is > legitimately concerned about a real issue and so we should use our creative > energies to characterize our HNSW implementation's performance, document it > clearly for users, and uncover ways to improve it. > >> > >> Mike McCandless > >> > >> http://blog.mikemccandless.com > >> > >> > >> On Mon, Apr 10, 2023 at 6:41 PM Alessandro Benedetti < > a.benede...@sease.io> wrote: > >>> I think Gus points are on target. > >>> > >>> I recommend we move this forward in this way: > >>> We stop any discussion and everyone interested proposes an option with > a motivation, then we aggregate the options and we create a Vote maybe? > >>> > >>> I am also on the same page on the fact that a veto should come with a > clear and reasonable technical merit, which also in my opinion has not come > yet. > >>> > >>> I also apologise if any of my words sounded harsh or personal attacks, > never meant to do so. > >>> > >>> My proposed option: > >>> > >>> 1) remove the limit and potentially make it configurable, > >>> Motivation: > >>> The system administrator can enforce a limit its users need to respect > that it's in line with whatever the admin decided to be acceptable for them. > >>> Default can stay the current one. > >>> > >>> That's my favourite at the moment, but I agree that potentially in the > future this may need to change, as we may optimise the data structures for > certain dimensions. I am a big fan of Yagni (you aren't going to need it) > so I am ok we'll face a different discussion if that happens in the future. > >>> > >>> > >>> > >>> On Sun, 9 Apr 2023, 18:46 Gus Heck, <gus.h...@gmail.com> wrote: > >>>> What I see so far: > >>>> > >>>> Much positive support for raising the limit > >>>> Slightly less support for removing it or making it configurable > >>>> A single veto which argues that a (as yet undefined) performance > standard must be met before raising the limit > >>>> Hot tempers (various) making this discussion difficult > >>>> > >>>> As I understand it, vetoes must have technical merit. I'm not sure > that this veto rises to "technical merit" on 2 counts: > >>>> > >>>> No standard for the performance is given so it cannot be technically > met. Without hard criteria it's a moving target. > >>>> It appears to encode a valuation of the user's time, and that > valuation is really up to the user. Some users may consider 2hours useless > and not worth it, and others might happily wait 2 hours. This is not a > technical decision, it's a business decision regarding the relative value > of the time invested vs the value of the result. If I can cure cancer by > indexing for a year, that might be worth it... (hyperbole of course). > >>>> > >>>> Things I would consider to have technical merit that I don't hear: > >>>> > >>>> Impact on the speed of **other** indexing operations. (devaluation of > other functionality) > >>>> Actual scenarios that work when the limit is low and fail when the > limit is high (new failure on the same data with the limit raised). > >>>> > >>>> One thing that might or might not have technical merit > >>>> > >>>> If someone feels there is a lack of documentation of the > costs/performance implications of using large vectors, possibly including > reproducible benchmarks establishing the scaling behavior (there seems to > be disagreement on O(n) vs O(n^2)). > >>>> > >>>> The users *should* know what they are getting into, but if the cost > is worth it to them, they should be able to pay it without forking the > project. If this veto causes a fork that's not good. > >>>> > >>>> On Sun, Apr 9, 2023 at 7:55 AM Michael Sokolov <msoko...@gmail.com> > wrote: > >>>>> We do have a dataset built from Wikipedia in luceneutil. It comes in > 100 and 300 dimensional varieties and can easily enough generate large > numbers of vector documents from the articles data. To go higher we could > concatenate vectors from that and I believe the performance numbers would > be plausible. > >>>>> > >>>>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss <dawid.we...@gmail.com> > wrote: > >>>>>> Can we set up a branch in which the limit is bumped to 2048, then > have > >>>>>> a realistic, free data set (wikipedia sample or something) that has, > >>>>>> say, 5 million docs and vectors created using public data (glove > >>>>>> pre-trained embeddings or the like)? We then could run indexing on > the > >>>>>> same hardware with 512, 1024 and 2048 and see what the numbers, > limits > >>>>>> and behavior actually are. > >>>>>> > >>>>>> I can help in writing this but not until after Easter. > >>>>>> > >>>>>> > >>>>>> Dawid > >>>>>> > >>>>>> On Sat, Apr 8, 2023 at 11:29 PM Adrien Grand <jpou...@gmail.com> > wrote: > >>>>>>> As Dawid pointed out earlier on this thread, this is the rule for > >>>>>>> Apache projects: a single -1 vote on a code change is a veto and > >>>>>>> cannot be overridden. Furthermore, Robert is one of the people on > this > >>>>>>> project who worked the most on debugging subtle bugs, making Lucene > >>>>>>> more robust and improving our test framework, so I'm listening > when he > >>>>>>> voices quality concerns. > >>>>>>> > >>>>>>> The argument against removing/raising the limit that resonates > with me > >>>>>>> the most is that it is a one-way door. As MikeS highlighted > earlier on > >>>>>>> this thread, implementations may want to take advantage of the fact > >>>>>>> that there is a limit at some point too. This is why I don't want > to > >>>>>>> remove the limit and would prefer a slight increase, such as 2048 > as > >>>>>>> suggested in the original issue, which would enable most of the > things > >>>>>>> that users who have been asking about raising the limit would like > to > >>>>>>> do. > >>>>>>> > >>>>>>> I agree that the merge-time memory usage and slow indexing rate are > >>>>>>> not great. But it's still possible to index multi-million vector > >>>>>>> datasets with a 4GB heap without hitting OOMEs regardless of the > >>>>>>> number of dimensions, and the feedback I'm seeing is that many > users > >>>>>>> are still interested in indexing multi-million vector datasets > despite > >>>>>>> the slow indexing rate. I wish we could do better, and vector > indexing > >>>>>>> is certainly more expert than text indexing, but it still is > usable in > >>>>>>> my opinion. I understand how giving Lucene more information about > >>>>>>> vectors prior to indexing (e.g. clustering information as Jim > pointed > >>>>>>> out) could help make merging faster and more memory-efficient, but > I > >>>>>>> would really like to avoid making it a requirement for indexing > >>>>>>> vectors as it also makes this feature much harder to use. > >>>>>>> > >>>>>>> On Sat, Apr 8, 2023 at 9:28 PM Alessandro Benedetti > >>>>>>> <a.benede...@sease.io> wrote: > >>>>>>>> I am very attentive to listen opinions but I am un-convinced here > and I an not sure that a single person opinion should be allowed to be > detrimental for such an important project. > >>>>>>>> > >>>>>>>> The limit as far as I know is literally just raising an exception. > >>>>>>>> Removing it won't alter in any way the current performance for > users in low dimensional space. > >>>>>>>> Removing it will just enable more users to use Lucene. > >>>>>>>> > >>>>>>>> If new users in certain situations will be unhappy with the > performance, they may contribute improvements. > >>>>>>>> This is how you make progress. > >>>>>>>> > >>>>>>>> If it's a reputation thing, trust me that not allowing users to > play with high dimensional space will equally damage it. > >>>>>>>> > >>>>>>>> To me it's really a no brainer. > >>>>>>>> Removing the limit and enable people to use high dimensional > vectors will take minutes. > >>>>>>>> Improving the hnsw implementation can take months. > >>>>>>>> Pick one to begin with... > >>>>>>>> > >>>>>>>> And there's no-one paying me here, no company interest > whatsoever, actually I pay people to contribute, I am just convinced it's a > good idea. > >>>>>>>> > >>>>>>>> > >>>>>>>> On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcm...@gmail.com> wrote: > >>>>>>>>> I disagree with your categorization. I put in plenty of work and > >>>>>>>>> experienced plenty of pain myself, writing tests and fighting > these > >>>>>>>>> issues, after i saw that, two releases in a row, vector indexing > fell > >>>>>>>>> over and hit integer overflows etc on small datasets: > >>>>>>>>> > >>>>>>>>> https://github.com/apache/lucene/pull/11905 > >>>>>>>>> > >>>>>>>>> Attacking me isn't helping the situation. > >>>>>>>>> > >>>>>>>>> PS: when i said the "one guy who wrote the code" I didn't mean > it in > >>>>>>>>> any kind of demeaning fashion really. I meant to describe the > current > >>>>>>>>> state of usability with respect to indexing a few million docs > with > >>>>>>>>> high dimensions. You can scroll up the thread and see that at > least > >>>>>>>>> one other committer on the project experienced similar pain as > me. > >>>>>>>>> Then, think about users who aren't committers trying to use the > >>>>>>>>> functionality! > >>>>>>>>> > >>>>>>>>> On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov < > msoko...@gmail.com> wrote: > >>>>>>>>>> What you said about increasing dimensions requiring a bigger > ram buffer on merge is wrong. That's the point I was trying to make. Your > concerns about merge costs are not wrong, but your conclusion that we need > to limit dimensions is not justified. > >>>>>>>>>> > >>>>>>>>>> You complain that hnsw sucks it doesn't scale, but when I show > it scales linearly with dimension you just ignore that and complain about > something entirely different. > >>>>>>>>>> > >>>>>>>>>> You demand that people run all kinds of tests to prove you > wrong but when they do, you don't listen and you won't put in the work > yourself or complain that it's too hard. > >>>>>>>>>> > >>>>>>>>>> Then you complain about people not meeting you half way. Wow > >>>>>>>>>> > >>>>>>>>>> On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcm...@gmail.com> > wrote: > >>>>>>>>>>> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner > >>>>>>>>>>> <michael.wech...@wyona.com> wrote: > >>>>>>>>>>>> What exactly do you consider reasonable? > >>>>>>>>>>> Let's begin a real discussion by being HONEST about the current > >>>>>>>>>>> status. Please put politically correct or your own company's > wishes > >>>>>>>>>>> aside, we know it's not in a good state. > >>>>>>>>>>> > >>>>>>>>>>> Current status is the one guy who wrote the code can set a > >>>>>>>>>>> multi-gigabyte ram buffer and index a small dataset with 1024 > >>>>>>>>>>> dimensions in HOURS (i didn't ask what hardware). > >>>>>>>>>>> > >>>>>>>>>>> My concerns are everyone else except the one guy, I want it to > be > >>>>>>>>>>> usable. Increasing dimensions just means even bigger > multi-gigabyte > >>>>>>>>>>> ram buffer and bigger heap to avoid OOM on merge. > >>>>>>>>>>> It is also a permanent backwards compatibility decision, we > have to > >>>>>>>>>>> support it once we do this and we can't just say "oops" and > flip it > >>>>>>>>>>> back. > >>>>>>>>>>> > >>>>>>>>>>> It is unclear to me, if the multi-gigabyte ram buffer is > really to > >>>>>>>>>>> avoid merges because they are so slow and it would be DAYS > otherwise, > >>>>>>>>>>> or if its to avoid merges so it doesn't hit OOM. > >>>>>>>>>>> Also from personal experience, it takes trial and error (means > >>>>>>>>>>> experiencing OOM on merge!!!) before you get those heap values > correct > >>>>>>>>>>> for your dataset. This usually means starting over which is > >>>>>>>>>>> frustrating and wastes more time. > >>>>>>>>>>> > >>>>>>>>>>> Jim mentioned some ideas about the memory usage in > IndexWriter, seems > >>>>>>>>>>> to me like its a good idea. maybe the multigigabyte ram buffer > can be > >>>>>>>>>>> avoided in this way and performance improved by writing bigger > >>>>>>>>>>> segments with lucene's defaults. But this doesn't mean we can > simply > >>>>>>>>>>> ignore the horrors of what happens on merge. merging needs to > scale so > >>>>>>>>>>> that indexing really scales. > >>>>>>>>>>> > >>>>>>>>>>> At least it shouldnt spike RAM on trivial data amounts and > cause OOM, > >>>>>>>>>>> and definitely it shouldnt burn hours and hours of CPU in > O(n^2) > >>>>>>>>>>> fashion when indexing. > >>>>>>>>>>> > >>>>>>>>>>> > --------------------------------------------------------------------- > >>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >>>>>>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org > >>>>>>>>>>> > >>>>>>>>> > --------------------------------------------------------------------- > >>>>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >>>>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org > >>>>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Adrien > >>>>>>> > >>>>>>> > --------------------------------------------------------------------- > >>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >>>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org > >>>>>>> > >>>>>> > --------------------------------------------------------------------- > >>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > >>>>>> For additional commands, e-mail: dev-h...@lucene.apache.org > >>>>>> > >>>> > >>>> -- > >>>> http://www.needhamsoftware.com (work) > >>>> http://www.the111shift.com (play) > >> > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >