Re: [Proposal] Remove max number of dimensions for KNN vectors

Michael Wechner Sat, 08 Apr 2023 05:33:19 -0700

What exactly do you consider reasonable?

I think it would help if we could specify concrete requirements reperformance and scalability, because then we have a concrete goal whichwe can work with.

Do such requirements already exist or what would be a good starting point?

Re 2x worse, I think Michael Sokolov already pointed out that thingstake longer linearly with vector dimension, which is quite obvious forexample for a brute force implementation. I would argue this will be thecase for any implementation.

And last I would like to ask again, slightly different, do we wantpeople to use Lucene, which will give us an opportunity to learn fromand progress?


Thanks

Michael



Am 08.04.23 um 13:04 schrieb Robert Muir:

I don't think we have. The performance needs to be reasonable in order
to bump this limit. Otherwise bumping this limit makes the worst-case
2x worse than it already is!

Moreover, its clear something needs to happen to address the
scalability/lack of performance. I'd hate for this limit to be in the
way of that. Because of backwards compatibility, it's a one-way,
permanent, irreversible change.

I'm not sold by any means in any way yet. My vote remains the same.

On Fri, Apr 7, 2023 at 10:57 PM Michael Wechner
<[email protected]> wrote:

sorry to interrupt, but I think we get side-tracked from the original 
discussion to increase the vector dimension limit.

I think improving the vector indexing performance is one thing and making sure 
Lucene does not crash when increasing the vector dimension limit is another.

I think it is great to find better ways to index vectors, but I think this 
should not prevent people from being able to use models with higher vector 
dimensions than 1024.

The following comparison might not be perfect, but imagine we have invented a 
combustion engine, which is strong enough to move a car in the flat area, but 
when applying it to a truck to move things over mountains it will fail, because 
it is not strong enough. Would you prevent people from using the combustion 
engine for a car in the flat area?

Thanks

Michael



Am 08.04.23 um 00:15 schrieb jim ferenczi:

Keep in mind, there may be other ways to do it. In general if merging

something is going to be "heavyweight", we should think about it to
prevent things from going really bad overall.

Yep I agree. Personally I don t see how we can solve this without prior 
knowledge of the vectors. Faiss has a nice implementation that fits naturally 
with Lucene called IVF (
https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVF.html)
but if we want to avoid running kmeans on every merge we d require to provide 
the clusters for the entire index before indexing the first vector.
It s a complex issue…

On Fri, 7 Apr 2023 at 22:58, Robert Muir <[email protected]> wrote:

Personally i'd have to re-read the paper, but in general the merging
issue has to be addressed somehow to fix the overall indexing time
problem. It seems it gets "dodged" with huge rambuffers in the emails
here.
Keep in mind, there may be other ways to do it. In general if merging
something is going to be "heavyweight", we should think about it to
prevent things from going really bad overall.

As an example, I'm most familiar with adding DEFLATE compression to
stored fields. Previously, we'd basically decompress and recompress
the stored fields on merge, and LZ4 is so fast that it wasn't
obviously a problem. But with DEFLATE it got slower/heavier (more
intense compression algorithm), something had to be done or indexing
would be unacceptably slow. Hence if you look at storedfields writer,
there is "dirtiness" logic etc so that recompression is amortized over
time and doesn't happen on every merge.

On Fri, Apr 7, 2023 at 5:38 PM jim ferenczi <[email protected]> wrote:

I am also not sure that diskann would solve the merging issue. The idea 
describe in the paper is to run kmeans first to create multiple graphs, one per 
cluster. In our case the vectors in each segment could belong to different 
cluster so I don’t see how we could merge them efficiently.

On Fri, 7 Apr 2023 at 22:28, jim ferenczi <[email protected]> wrote:

The inference time (and cost) to generate these big vectors must be quite large 
too ;).
Regarding the ram buffer, we could drastically reduce the size by writing the 
vectors on disk instead of keeping them in the heap. With 1k dimensions the ram 
buffer is filled with these vectors quite rapidly.

On Fri, 7 Apr 2023 at 21:59, Robert Muir <[email protected]> wrote:

On Fri, Apr 7, 2023 at 7:47 AM Michael Sokolov <[email protected]> wrote:

8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer size=1994)

Robert, since you're the only on-the-record veto here, does this
change your thinking at all, or if not could you share some test
results that didn't go the way you expected? Maybe we can find some
mitigation if we focus on a specific issue.

My scale concerns are both space and time. What does the execution
time look like if you don't set insanely large IW rambuffer? The
default is 16MB. Just concerned we're shoving some problems under the
rug :)

Even with the yuge RAMbuffer, we're still talking about almost 2 hours
to index 4M documents with these 2k vectors. Whereas you'd measure
this in seconds with typical lucene indexing, its nothing.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [Proposal] Remove max number of dimensions for KNN vectors

Reply via email to