I understand what you mean that it seems to be artificial, but I don't understand why this matters to test performance and scalability of the indexing?

Let's assume the limit of Lucene would be 4 instead of 1024 and there are only open source models generating vectors with 4 dimensions, for example

0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814

0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844

-0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106

-0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114

and now I concatenate them to vectors with 8 dimensions


0.02150459587574005,0.11223817616701126,-0.007903356105089188,0.03795722872018814,0.026009393855929375,0.006306684575974941,0.020492585375905037,-0.029064252972602844

-0.08239810913801193,-0.01947402022778988,0.03827739879488945,-0.020566290244460106,-0.007012288551777601,-0.026665858924388885,0.044495150446891785,-0.038030195981264114

and normalize them to length 1.

Why should this be any different to a model which is acting like a black box generating vectors with 8 dimensions?




Am 11.04.23 um 19:05 schrieb Michael Sokolov:
What exactly do you consider real vector data? Vector data which is based on 
texts written by humans?
We have plenty of text; the problem is coming up with a realistic
vector model that requires as many dimensions as people seem to be
demanding. As I said above, after surveying huggingface I couldn't
find any text-based model using more than 768 dimensions. So far we
have some ideas of generating higher-dimensional data by dithering or
concatenating existing data, but it seems artificial.

On Tue, Apr 11, 2023 at 9:31 AM Michael Wechner
<michael.wech...@wyona.com> wrote:
What exactly do you consider real vector data? Vector data which is based on 
texts written by humans?

I am asking, because I recently attended the following presentation by 
Anastassia Shaitarova (UZH Institute for Computational Linguistics, 
https://www.cl.uzh.ch/de/people/team/compling/shaitarova.html)

----

Can we Identify Machine-Generated Text? An Overview of Current Approaches
by Anastassia Shaitarova (UZH Institute for Computational Linguistics)

The detection of machine-generated text has become increasingly important due 
to the prevalence of automated content generation and its potential for misuse. 
In this talk, we will discuss the motivation for automatic detection of 
generated text. We will present the currently available methods, including 
feature-based classification as a “first line-of-defense.” We will provide an 
overview of the detection tools that have been made available so far and 
discuss their limitations. Finally, we will reflect on some open problems 
associated with the automatic discrimination of generated texts.

----

and her conclusion was that it has become basically impossible to differentiate 
between text generated by humans and text generated by for example ChatGPT.

Whereas others have a slightly different opinion, see for example

https://www.wired.com/story/how-to-spot-generative-ai-text-chatgpt/

But I would argue that real world and synthetic have become close enough that 
testing performance and scalability of indexing should be possible with 
synthetic data.

I completely agree that we have to base our discussions and decisions on 
scientific methods and that we have to make sure that Lucene performs and 
scales well and that we understand the limits and what is going on under the 
hood.

Thanks

Michael W





Am 11.04.23 um 14:29 schrieb Michael McCandless:

+1 to test on real vector data -- if you test on synthetic data you draw 
synthetic conclusions.

Can someone post the theoretical performance (CPU and RAM required) of HNSW 
construction?  Do we know/believe our HNSW implementation has achieved that 
theoretical big-O performance?  Maybe we have some silly performance bug that's 
causing it not to?

As I understand it, HNSW makes the tradeoff of costly construction for faster 
searching, which is typically the right tradeoff for search use cases.  We do 
this in other parts of the Lucene index too.

Lucene will do a logarithmic number of merges over time, i.e. each doc will be 
merged O(log(N)) times in its lifetime in the index.  We need to multiply that 
by the cost of re-building the whole HNSW graph on each merge.  BTW, other 
things in Lucene, like BKD/dimensional points, also rebuild the whole data 
structure on each merge, I think?  But, as Rob pointed out, stored fields 
merging do indeed do some sneaky tricks to avoid excessive block 
decompress/recompress on each merge.

As I understand it, vetoes must have technical merit. I'm not sure that this veto rises 
to "technical merit" on 2 counts:
Actually I think Robert's veto stands on its technical merit already.  Robert's 
take on technical matters very much resonate with me, even if he is sometimes 
prickly in how he expresses them ;)

His point is that we, as a dev community, are not paying enough attention to 
the indexing performance of our KNN algo (HNSW) and implementation, and that it 
is reckless to increase / remove limits in that state.  It is indeed a one-way 
door decision and one must confront such decisions with caution, especially for 
such a widely used base infrastructure as Lucene.  We don't even advertise 
today in our javadocs that you need XXX heap if you index vectors with 
dimension Y, fanout X, levels Z, etc.

RAM used during merging is unaffected by dimensionality, but is affected by 
fanout, because the HNSW graph (not the raw vectors) is memory resident, I 
think?  Maybe we could move it off-heap and let the OS manage the memory (and 
still document the RAM requirements)?  Maybe merge RAM costs should be 
accounted for in IW's RAM buffer accounting?  It is not today, and there are 
some other things that use non-trivial RAM, e.g. the doc mapping (to compress 
docid space when deletions are reclaimed).

When we added KNN vector testing to Lucene's nightly benchmarks, the indexing 
time massively increased -- see annotations DH and DP here: 
https://home.apache.org/~mikemccand/lucenebench/indexing.html.  Nightly 
benchmarks now start at 6 PM and don't finish until ~14.5 hours later.  Of 
course, that is using a single thread for indexing (on a box that has 128 
cores!) so we produce a deterministic index every night ...

Stepping out (meta) a bit ... this discussion is precisely one of the awesome 
benefits of the (informed) veto.  It means risky changes to the software, as 
determined by any single informed developer on the project, can force a healthy 
discussion about the problem at hand.  Robert is legitimately concerned about a 
real issue and so we should use our creative energies to characterize our HNSW 
implementation's performance, document it clearly for users, and uncover ways 
to improve it.

Mike McCandless

http://blog.mikemccandless.com


On Mon, Apr 10, 2023 at 6:41 PM Alessandro Benedetti <a.benede...@sease.io> 
wrote:
I think Gus points are on target.

I recommend we move this forward in this way:
We stop any discussion and everyone interested proposes an option with a 
motivation, then we aggregate the options and we create a Vote maybe?

I am also on the same page on the fact that a veto should come with a clear and 
reasonable technical merit, which also in my opinion has not come yet.

I also apologise if any of my words sounded harsh or personal attacks, never 
meant to do so.

My proposed option:

1) remove the limit and potentially make it configurable,
Motivation:
The system administrator can enforce a limit its users need to respect that 
it's in line with whatever the admin decided to be acceptable for them.
Default can stay the current one.

That's my favourite at the moment, but I agree that potentially in the future 
this may need to change, as we may optimise the data structures for certain 
dimensions. I  am a big fan of Yagni (you aren't going to need it) so I am ok 
we'll face a different discussion if that happens in the future.



On Sun, 9 Apr 2023, 18:46 Gus Heck, <gus.h...@gmail.com> wrote:
What I see so far:

Much positive support for raising the limit
Slightly less support for removing it or making it configurable
A single veto which argues that a (as yet undefined) performance standard must 
be met before raising the limit
Hot tempers (various) making this discussion difficult

As I understand it, vetoes must have technical merit. I'm not sure that this veto rises 
to "technical merit" on 2 counts:

No standard for the performance is given so it cannot be technically met. 
Without hard criteria it's a moving target.
It appears to encode a valuation of the user's time, and that valuation is 
really up to the user. Some users may consider 2hours useless and not worth it, 
and others might happily wait 2 hours. This is not a technical decision, it's a 
business decision regarding the relative value of the time invested vs the 
value of the result. If I can cure cancer by indexing for a year, that might be 
worth it... (hyperbole of course).

Things I would consider to have technical merit that I don't hear:

Impact on the speed of **other** indexing operations. (devaluation of other 
functionality)
Actual scenarios that work when the limit is low and fail when the limit is 
high (new failure on the same data with the limit raised).

One thing that might or might not have technical merit

If someone feels there is a lack of documentation of the costs/performance 
implications of using large vectors, possibly including reproducible benchmarks 
establishing the scaling behavior (there seems to be disagreement on O(n) vs 
O(n^2)).

The users *should* know what they are getting into, but if the cost is worth it 
to them, they should be able to pay it without forking the project. If this 
veto causes a fork that's not good.

On Sun, Apr 9, 2023 at 7:55 AM Michael Sokolov <msoko...@gmail.com> wrote:
We do have a dataset built from Wikipedia in luceneutil. It comes in 100 and 
300 dimensional varieties and can easily enough generate large numbers of 
vector documents from the articles data. To go higher we could concatenate 
vectors from that and I believe the performance numbers would be plausible.

On Sun, Apr 9, 2023, 1:32 AM Dawid Weiss <dawid.we...@gmail.com> wrote:
Can we set up a branch in which the limit is bumped to 2048, then have
a realistic, free data set (wikipedia sample or something) that has,
say, 5 million docs and vectors created using public data (glove
pre-trained embeddings or the like)? We then could run indexing on the
same hardware with 512, 1024 and 2048 and see what the numbers, limits
and behavior actually are.

I can help in writing this but not until after Easter.


Dawid

On Sat, Apr 8, 2023 at 11:29 PM Adrien Grand <jpou...@gmail.com> wrote:
As Dawid pointed out earlier on this thread, this is the rule for
Apache projects: a single -1 vote on a code change is a veto and
cannot be overridden. Furthermore, Robert is one of the people on this
project who worked the most on debugging subtle bugs, making Lucene
more robust and improving our test framework, so I'm listening when he
voices quality concerns.

The argument against removing/raising the limit that resonates with me
the most is that it is a one-way door. As MikeS highlighted earlier on
this thread, implementations may want to take advantage of the fact
that there is a limit at some point too. This is why I don't want to
remove the limit and would prefer a slight increase, such as 2048 as
suggested in the original issue, which would enable most of the things
that users who have been asking about raising the limit would like to
do.

I agree that the merge-time memory usage and slow indexing rate are
not great. But it's still possible to index multi-million vector
datasets with a 4GB heap without hitting OOMEs regardless of the
number of dimensions, and the feedback I'm seeing is that many users
are still interested in indexing multi-million vector datasets despite
the slow indexing rate. I wish we could do better, and vector indexing
is certainly more expert than text indexing, but it still is usable in
my opinion. I understand how giving Lucene more information about
vectors prior to indexing (e.g. clustering information as Jim pointed
out) could help make merging faster and more memory-efficient, but I
would really like to avoid making it a requirement for indexing
vectors as it also makes this feature much harder to use.

On Sat, Apr 8, 2023 at 9:28 PM Alessandro Benedetti
<a.benede...@sease.io> wrote:
I am very attentive to listen opinions but I am un-convinced here and I an not 
sure that a single person opinion should be allowed to be detrimental for such 
an important project.

The limit as far as I know is literally just raising an exception.
Removing it won't alter in any way the current performance for users in low 
dimensional space.
Removing it will just enable more users to use Lucene.

If new users in certain situations will be unhappy with the performance, they 
may contribute improvements.
This is how you make progress.

If it's a reputation thing, trust me that not allowing users to play with high 
dimensional space will equally damage it.

To me it's really a no brainer.
Removing the limit and enable people to use high dimensional vectors will take 
minutes.
Improving the hnsw implementation can take months.
Pick one to begin with...

And there's no-one paying me here, no company interest whatsoever, actually I 
pay people to contribute, I am just convinced it's a good idea.


On Sat, 8 Apr 2023, 18:57 Robert Muir, <rcm...@gmail.com> wrote:
I disagree with your categorization. I put in plenty of work and
experienced plenty of pain myself, writing tests and fighting these
issues, after i saw that, two releases in a row, vector indexing fell
over and hit integer overflows etc on small datasets:

https://github.com/apache/lucene/pull/11905

Attacking me isn't helping the situation.

PS: when i said the "one guy who wrote the code" I didn't mean it in
any kind of demeaning fashion really. I meant to describe the current
state of usability with respect to indexing a few million docs with
high dimensions. You can scroll up the thread and see that at least
one other committer on the project experienced similar pain as me.
Then, think about users who aren't committers trying to use the
functionality!

On Sat, Apr 8, 2023 at 12:51 PM Michael Sokolov <msoko...@gmail.com> wrote:
What you said about increasing dimensions requiring a bigger ram buffer on 
merge is wrong. That's the point I was trying to make. Your concerns about 
merge costs are not wrong, but your conclusion that we need to limit dimensions 
is not justified.

You complain that hnsw sucks it doesn't scale, but when I show it scales 
linearly with dimension you just ignore that and complain about something 
entirely different.

You demand that people run all kinds of tests to prove you wrong but when they 
do, you don't listen and you won't put in the work yourself or complain that 
it's too hard.

Then you complain about people not meeting you half way. Wow

On Sat, Apr 8, 2023, 12:40 PM Robert Muir <rcm...@gmail.com> wrote:
On Sat, Apr 8, 2023 at 8:33 AM Michael Wechner
<michael.wech...@wyona.com> wrote:
What exactly do you consider reasonable?
Let's begin a real discussion by being HONEST about the current
status. Please put politically correct or your own company's wishes
aside, we know it's not in a good state.

Current status is the one guy who wrote the code can set a
multi-gigabyte ram buffer and index a small dataset with 1024
dimensions in HOURS (i didn't ask what hardware).

My concerns are everyone else except the one guy, I want it to be
usable. Increasing dimensions just means even bigger multi-gigabyte
ram buffer and bigger heap to avoid OOM on merge.
It is also a permanent backwards compatibility decision, we have to
support it once we do this and we can't just say "oops" and flip it
back.

It is unclear to me, if the multi-gigabyte ram buffer is really to
avoid merges because they are so slow and it would be DAYS otherwise,
or if its to avoid merges so it doesn't hit OOM.
Also from personal experience, it takes trial and error (means
experiencing OOM on merge!!!) before you get those heap values correct
for your dataset. This usually means starting over which is
frustrating and wastes more time.

Jim mentioned some ideas about the memory usage in IndexWriter, seems
to me like its a good idea. maybe the multigigabyte ram buffer can be
avoided in this way and performance improved by writing bigger
segments with lucene's defaults. But this doesn't mean we can simply
ignore the horrors of what happens on merge. merging needs to scale so
that indexing really scales.

At least it shouldnt spike RAM on trivial data amounts and cause OOM,
and definitely it shouldnt burn hours and hours of CPU in O(n^2)
fashion when indexing.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


--
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to