I only know some characteristics of the openAI ada-002 vectors,
although they are a very popular as
embeddings/text-characterisations as they allow more
accurate/"human meaningful" semantic search results with fewer
dimensions than their predecessors - I've evaluated a few
different embedding models, including some BERT variants, CLIP
ViT-L-14 (with 768 dims, which was quite good), openAI's ada-001
(1024 dims) and babbage-001 (2048 dims), and ada-002 are
qualitatively the best, although that will certainly change!
In any case, ada-002 vectors have interesting characteristics
that I think mean you could confidently create synthetic vectors
which would be hard to distinguish from "real" vectors. I found
this from looking at 47K ada-002 vectors generated across a full
year (1994) of newspaper articles from the Canberra Times and
200K wikipedia articles:
- there is no discernible/significant correlation between values
in any pair of dimensions
- all but 5 of the 1536 dimensions have an almost identical
distribution of values shown in the central blob on these graphs
(that just show a few of these 1531 dimensions with clumped
values and the 5 "outlier" dimensions, but all 1531 non-outlier
dims are in there, which makes for some easy quantisation from
float to byte if you dont want to go the full
kmeans/clustering/Lloyds-algorithm approach):
https://docs.google.com/spreadsheets/d/1DyyBCbirETZSUAEGcMK__mfbUNzsU_L48V9E0SyJYGg/edit?usp=sharing
https://docs.google.com/spreadsheets/d/1czEAlzYdyKa6xraRLesXjNZvEzlj27TcDGiEFS1-MPs/edit?usp=sharing
https://docs.google.com/spreadsheets/d/1RxTjV7Sj14etCNLk1GB-m44CXJVKdXaFlg2Y6yvj3z4/edit?usp=sharing
- the variance of the value of each dimension is characteristic:
https://docs.google.com/spreadsheets/d/1w5LnRUXt1cRzI9Qwm07LZ6UfszjMOgPaJot9cOGLHok/edit#gid=472178228
This probably represents something significant about how the
ada-002 embeddings are created, but I think it also means
creating "realistic" values is possible. I did not use this
information when testing recall & performance on Lucene's HNSW
implementation on 192m documents, as I slightly dithered the
values of a "real" set on 47K docs and stored other fields in
the doc that referenced the "base" document that the dithers
were made from, and used different dithering magnitudes so that
I could test recall with different neighbour sizes ("M"),
construction-beamwidth and search-beamwidths.
best regards
Kent Fitch
On Wed, Apr 12, 2023 at 5:08 AM Michael Wechner
<michael.wech...@wyona.com> wrote:
I understand what you mean that it seems to be artificial,
but I don't
understand why this matters to test performance and
scalability of the
indexing?
Let's assume the limit of Lucene would be 4 instead of 1024
and there
are only open source models generating vectors with 4
dimensions, for
example
0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814
0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
-0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106
-0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114
and now I concatenate them to vectors with 8 dimensions
0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814,0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844
-0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106,-0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114
and normalize them to length 1.
Why should this be any different to a model which is acting
like a black
box generating vectors with 8 dimensions?
Am 11.04.23 um 19:05 schrieb Michael Sokolov:
>> What exactly do you consider real vector data? Vector
data which is based on texts written by humans?
> We have plenty of text; the problem is coming up with a
realistic
> vector model that requires as many dimensions as people
seem to be
> demanding. As I said above, after surveying huggingface I
couldn't
> find any text-based model using more than 768 dimensions.
So far we
> have some ideas of generating higher-dimensional data by
dithering or
> concatenating existing data, but it seems artificial.
>
> On Tue, Apr 11, 2023 at 9:31 AM Michael Wechner
> <michael.wech...@wyona.com> wrote:
>> What exactly do you consider real vector data? Vector
data which is based on texts written by humans?
>>
>> I am asking, because I recently attended the following
presentation by Anastassia Shaitarova (UZH Institute for
Computational Linguistics,
https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html)
>>
>> ----
>>
>> Can we Identify Machine-Generated Text? An Overview of
Current Approaches
>> by Anastassia Shaitarova (UZH Institute for Computational
Linguistics)
>>
>> The detection of machine-generated text has become
increasingly important due to the prevalence of automated
content generation and its potential for misuse. In this
talk, we will discuss the motivation for automatic detection
of generated text. We will present the currently available
methods, including feature-based classification as a “first
line-of-defense.” We will provide an overview of the
detection tools that have been made available so far and
discuss their limitations. Finally, we will reflect on some
open problems associated with the automatic discrimination
of generated texts.
>>
>> ----
>>
>> and her conclusion was that it has become basically
impossible to differentiate between text generated by humans
and text generated by for example ChatGPT.
>>
>> Whereas others have a slightly different opinion, see for
example
>>
>>
https://www.wired.com/story/how-to-spot-generative-ai-text-chatgpt/
>>
>> But I would argue that real world and synthetic have
become close enough that testing performance and scalability
of indexing should be possible with synthetic data.
>>
>> I completely agree that we have to base our discussions
and decisions on scientific methods and that we have to make
sure that Lucene performs and scales well and that we
understand the limits and what is going on under the hood.
>>
>> Thanks
>>
>> Michael W
>>
>>
>>
>>
>>
>> Am 11.04.23 um 14:29 schrieb Michael McCandless:
>>
>> +1 to test on real vector data -- if you test on
synthetic data you draw synthetic conclusions.
>>
>> Can someone post the theoretical performance (CPU and RAM
required) of HNSW construction? Do we know/believe our HNSW
implementation has achieved that theoretical big-O
performance? Maybe we have some silly performance bug
that's causing it not to?
>>
>> As I understand it, HNSW makes the tradeoff of costly
construction for faster searching, which is typically the
right tradeoff for search use cases. We do this in other
parts of the Lucene index too.
>>
>> Lucene will do a logarithmic number of merges over time,
i.e. each doc will be merged O(log(N)) times in its lifetime
in the index. We need to multiply that by the cost of
re-building the whole HNSW graph on each merge. BTW, other
things in Lucene, like BKD/dimensional points, also rebuild
the whole data structure on each merge, I think? But, as Rob
pointed out, stored fields merging do indeed do some sneaky
tricks to avoid excessive block decompress/recompress on
each merge.
>>
>>> As I understand it, vetoes must have technical merit.
I'm not sure that this veto rises to "technical merit" on 2
counts:
>> Actually I think Robert's veto stands on its technical
merit already. Robert's take on technical matters very much
resonate with me, even if he is sometimes prickly in how he
expresses them ;)
>>
>> His point is that we, as a dev community, are not paying
enough attention to the indexing performance of our KNN algo
(HNSW) and implementation, and that it is reckless to
increase / remove limits in that state. It is indeed a
one-way door decision and one must confront such decisions
with caution, especially for such a widely used base
infrastructure as Lucene. We don't even advertise today in
our javadocs that you need XXX heap if you index vectors
with dimension Y, fanout X, levels Z, etc.
>>
>> RAM used during merging is unaffected by dimensionality,
but is affected by fanout, because the HNSW graph (not the
raw vectors) is memory resident, I think? Maybe we could
move it off-heap and let the OS manage the memory (and still
document the RAM requirements)? Maybe merge RAM costs
should be accounted for in IW's RAM buffer accounting? It
is not today, and there are some other things that use
non-trivial RAM, e.g. the doc mapping (to compress docid
space when deletions are reclaimed).
>>
>> When we added KNN vector testing to Lucene's nightly
benchmarks, the indexing time massively increased -- see
annotations DH and DP here:
https://home.apache.org/~mikemccand/lucenebench/indexing.html.
Nightly benchmarks now start at 6 PM and don't finish until
~14.5 hours later. Of course, that is using a single thread
for indexing (on a box that has 128 cores!) so we produce a
deterministic index every night ...
>>
>> Stepping out (meta) a bit ... this discussion is
precisely one of the awesome benefits of the (informed)
veto. It means risky changes to the software, as determined
by any single informed developer on the project, can force a
healthy discussion about the problem at hand. Robert is
legitimately concerned about a real issue and so we should
use our creative energies to characterize our HNSW
implementation's performance, document it clearly for users,
and uncover ways to improve it.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Mon, Apr 10, 2023 at 6:41 PM Alessandro Benedetti
<a.benede...@sease.io> wrote:
>>> I think Gus points are on target.
>>>
>>> I recommend we move this forward in this way:
>>> We stop any discussion and everyone interested proposes
an option with a motivation, then we aggregate the options
and we create a Vote maybe?
>>>
>>> I am also on the same page on the fact that a veto
should come with a clear and reasonable technical merit,
which also in my opinion has not come yet.
>>>
>>> I also apologise if any of my words sounded harsh or
personal attacks, never meant to do so.
>>>
>>> My proposed option:
>>>
>>> 1) remove the limit and potentially make it configurable,
>>> Motivation:
>>> The system administrator can enforce a limit its users
need to respect that it's in line with whatever the admin
decided to be acceptable for them.
>>> Default can stay the current one.
>>>
>>> That's my favourite at the moment, but I agree that
potentially in the future this may need to change, as we may
optimise the data structures for certain dimensions. I am a
big fan of Yagni (you aren't going to need it) so I am ok
we'll face a different discussion if that happens in the future.
>>>
>>>
>>>
>>> On Sun, 9 Apr 2023, 18:46 Gus Heck, <gus.h...@gmail.com>
wrote:
>>>> What I see so far:
>>>>
>>>> Much positive support for raising the limit
>>>> Slightly less support for removing it or making it
configurable
>>>> A single veto which argues that a (as yet undefined)
performance standard must be met before raising the limit
>>>> Hot tempers (various) making this discussion difficult
>>>>
>>>> As I understand it, vetoes must have technical merit.
I'm not sure that this veto rises to "technical merit" on 2
counts:
>>>>
>>>> No standard for the performance is given so it cannot
be technically met. Without hard criteria it's a moving target.
>>>> It appears to encode a valuation of the user's time,
and that valuation is really up to the user. Some users may
consider 2hours useless and not worth it, and others might
happily wait 2 hours. This is not a technical decision, it's
a business decision regarding the relative value of the time
invested vs the value of the result. If I can cure cancer by
indexing for a year, that might be worth it... (hyperbole of
course).
>>>>
>>>> Things I would consider to have technical merit that I
don't hear:
>>>>
>>>> Impact on the speed of **other** indexing operations.
(devaluation of other functionality)
>>>> Actual scenarios that work when the limit is low and
fail when the limit is high (new failure on the same data
with the limit raised).
>>>>
>>>> One thing that might or might not have technical merit
>>>>
>>>> If someone feels there is a lack of documentation of
the costs/performance implications of using large vectors,
possibly including reproducible benchmarks establishing the
scaling behavior (there seems to be disagreement on O(n) vs
O(n^2)).
>>>>
>>>> The users *should* know what they are getting into, but
if the cost is worth it to them, they should be able to pay
it without forking the project. If this veto causes a fork
that's not good.
>>>>
>>>> On Sun, Apr 9, 2023 at 7:55 AM Michael Sokolov
<msoko...@gmail.com> wrote:
>>>>> We do have a dataset built from Wikipedia in
luceneutil. It comes in 100 and 300 dimensional varieties
and can easily enough generate large numbers of vector
documents from the articles data. To go higher we could
concatenate vectors from that and I believe the performance
numbers would be plausible.
>>>>>
>>>>> On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss
<dawid.we...@gmail.com> wrote:
>>>>>> Can we set up a branch in which the limit is bumped
to 2048, then have
>>>>>> a realistic, free data set (wikipedia sample or
something) that has,
>>>>>> say, 5 million docs and vectors created using public
data (glove
>>>>>> pre-trained embeddings or the like)? We then could
run indexing on the
>>>>>> same hardware with 512, 1024 and 2048 and see what
the numbers, limits
>>>>>> and behavior actually are.
>>>>>>
>>>>>> I can help in writing this but not until after Easter.
>>>>>>
>>>>>>
>>>>>> Dawid
>>>>>>
>>>>>> On Sat, Apr 8, 2023 at 11:29 PM Adrien Grand
<jpou...@gmail.com> wrote:
>>>>>>> As Dawid pointed out earlier on this thread, this is
the rule for
>>>>>>> Apache projects: a single -1 vote on a code change
is a veto and
>>>>>>> cannot be overridden. Furthermore, Robert is one of
the people on this
>>>>>>> project who worked the most on debugging subtle
bugs, making Lucene
>>>>>>> more robust and improving our test framework, so I'm
listening when he
>>>>>>> voices quality concerns.
>>>>>>>
>>>>>>> The argument against removing/raising the limit that
resonates with me
>>>>>>> the most is that it is a one-way door. As MikeS
highlighted earlier on
>>>>>>> this thread, implementations may want to take
advantage of the fact
>>>>>>> that there is a limit at some point too. This is why
I don't want to
>>>>>>> remove the limit and would prefer a slight increase,
such as 2048 as
>>>>>>> suggested in the original issue, which would enable
most of the things
>>>>>>> that users who have been asking about raising the
limit would like to
>>>>>>> do.
>>>>>>>
>>>>>>> I agree that the merge-time memory usage and slow
indexing rate are
>>>>>>> not great. But it's still possible to index
multi-million vector
>>>>>>> datasets with a 4GB heap without hitting OOMEs
regardless of the
>>>>>>> number of dimensions, and the feedback I'm seeing is
that many users
>>>>>>> are still interested in indexing multi-million
vector datasets despite
>>>>>>> the slow indexing rate. I wish we could do better,
and vector indexing
>>>>>>> is certainly more expert than text indexing, but it
still is usable in
>>>>>>> my opinion. I understand how giving Lucene more
information about
>>>>>>> vectors prior to indexing (e.g. clustering
information as Jim pointed
>>>>>>> out) could help make merging faster and more
memory-efficient, but I
>>>>>>> would really like to avoid making it a requirement
for indexing
>>>>>>> vectors as it also makes this feature much harder to
use.
>>>>>>>
>>>>>>> On Sat, Apr 8, 2023 at 9:28 PM Alessandro Benedetti
>>>>>>> <a.benede...@sease.io> wrote:
>>>>>>>> I am very attentive to listen opinions but I am
un-convinced here and I an not sure that a single person
opinion should be allowed to be detrimental for such an
important project.
>>>>>>>>
>>>>>>>> The limit as far as I know is literally just
raising an exception.
>>>>>>>> Removing it won't alter in any way the current
performance for users in low dimensional space.
>>>>>>>> Removing it will just enable more users to use Lucene.
>>>>>>>>
>>>>>>>> If new users in certain situations will be unhappy
with the performance, they may contribute improvements.
>>>>>>>> This is how you make progress.
>>>>>>>>
>>>>>>>> If it's a reputation thing, trust me that not
allowing users to play with high dimensional space will
equally damage it.
>>>>>>>>
>>>>>>>> To me it's really a no brainer.
>>>>>>>> Removing the limit and enable people to use high
dimensional vectors will take minutes.
>>>>>>>> Improving the hnsw implementation can take months.
>>>>>>>> Pick one to begin with...
>>>>>>>>
>>>>>>>> And there's no-one paying me here, no company
interest whatsoever, actually I pay people to contribute, I
am just convinced it's a good idea.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, 8 Apr 2023, 18:57 Robert Muir,
<rcm...@gmail.com> wrote:
>>>>>>>>> I disagree with your categorization. I put in
plenty of work and
>>>>>>>>> experienced plenty of pain myself, writing tests
and fighting these
>>>>>>>>> issues, after i saw that, two releases in a row,
vector indexing fell
>>>>>>>>> over and hit integer overflows etc on small datasets:
>>>>>>>>>
>>>>>>>>> https://github.com/apache/lucene/pull/11905
>>>>>>>>>
>>>>>>>>> Attacking me isn't helping the situation.
>>>>>>>>>
>>>>>>>>> PS: when i said the "one guy who wrote the code" I
didn't mean it in
>>>>>>>>> any kind of demeaning fashion really. I meant to
describe the current
>>>>>>>>> state of usability with respect to indexing a few
million docs with
>>>>>>>>> high dimensions. You can scroll up the thread and
see that at least
>>>>>>>>> one other committer on the project experienced
similar pain as me.
>>>>>>>>> Then, think about users who aren't committers
trying to use the
>>>>>>>>> functionality!
>>>>>>>>>
>>>>>>>>> On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov
<msoko...@gmail.com> wrote:
>>>>>>>>>> What you said about increasing dimensions
requiring a bigger ram buffer on merge is wrong. That's the
point I was trying to make. Your concerns about merge costs
are not wrong, but your conclusion that we need to limit
dimensions is not justified.
>>>>>>>>>>
>>>>>>>>>> You complain that hnsw sucks it doesn't scale,
but when I show it scales linearly with dimension you just
ignore that and complain about something entirely different.
>>>>>>>>>>
>>>>>>>>>> You demand that people run all kinds of tests to
prove you wrong but when they do, you don't listen and you
won't put in the work yourself or complain that it's too hard.
>>>>>>>>>>
>>>>>>>>>> Then you complain about people not meeting you
half way. Wow
>>>>>>>>>>
>>>>>>>>>> On Sat, Apr 8, 2023, 12:40 PM Robert Muir
<rcm...@gmail.com> wrote:
>>>>>>>>>>> On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner
>>>>>>>>>>> <michael.wech...@wyona.com> wrote:
>>>>>>>>>>>> What exactly do you consider reasonable?
>>>>>>>>>>> Let's begin a real discussion by being HONEST
about the current
>>>>>>>>>>> status. Please put politically correct or your
own company's wishes
>>>>>>>>>>> aside, we know it's not in a good state.
>>>>>>>>>>>
>>>>>>>>>>> Current status is the one guy who wrote the code
can set a
>>>>>>>>>>> multi-gigabyte ram buffer and index a small
dataset with 1024
>>>>>>>>>>> dimensions in HOURS (i didn't ask what hardware).
>>>>>>>>>>>
>>>>>>>>>>> My concerns are everyone else except the one
guy, I want it to be
>>>>>>>>>>> usable. Increasing dimensions just means even
bigger multi-gigabyte
>>>>>>>>>>> ram buffer and bigger heap to avoid OOM on merge.
>>>>>>>>>>> It is also a permanent backwards compatibility
decision, we have to
>>>>>>>>>>> support it once we do this and we can't just say
"oops" and flip it
>>>>>>>>>>> back.
>>>>>>>>>>>
>>>>>>>>>>> It is unclear to me, if the multi-gigabyte ram
buffer is really to
>>>>>>>>>>> avoid merges because they are so slow and it
would be DAYS otherwise,
>>>>>>>>>>> or if its to avoid merges so it doesn't hit OOM.
>>>>>>>>>>> Also from personal experience, it takes trial
and error (means
>>>>>>>>>>> experiencing OOM on merge!!!) before you get
those heap values correct
>>>>>>>>>>> for your dataset. This usually means starting
over which is
>>>>>>>>>>> frustrating and wastes more time.
>>>>>>>>>>>
>>>>>>>>>>> Jim mentioned some ideas about the memory usage
in IndexWriter, seems
>>>>>>>>>>> to me like its a good idea. maybe the
multigigabyte ram buffer can be
>>>>>>>>>>> avoided in this way and performance improved by
writing bigger
>>>>>>>>>>> segments with lucene's defaults. But this
doesn't mean we can simply
>>>>>>>>>>> ignore the horrors of what happens on merge.
merging needs to scale so
>>>>>>>>>>> that indexing really scales.
>>>>>>>>>>>
>>>>>>>>>>> At least it shouldnt spike RAM on trivial data
amounts and cause OOM,
>>>>>>>>>>> and definitely it shouldnt burn hours and hours
of CPU in O(n^2)
>>>>>>>>>>> fashion when indexing.
>>>>>>>>>>>
>>>>>>>>>>>
---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe, e-mail:
dev-unsubscr...@lucene.apache.org
>>>>>>>>>>> For additional commands, e-mail:
dev-h...@lucene.apache.org
>>>>>>>>>>>
>>>>>>>>>
---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail:
dev-unsubscr...@lucene.apache.org
>>>>>>>>> For additional commands, e-mail:
dev-h...@lucene.apache.org
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Adrien
>>>>>>>
>>>>>>>
---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail:
dev-unsubscr...@lucene.apache.org
>>>>>>> For additional commands, e-mail:
dev-h...@lucene.apache.org
>>>>>>>
>>>>>>
---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>>>> For additional commands, e-mail:
dev-h...@lucene.apache.org
>>>>>>
>>>>
>>>> --
>>>> http://www.needhamsoftware.com (work)
>>>> http://www.the111shift.com (play)
>>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org