Re: [Proposal] Remove max number of dimensions for KNN vectors

Robert Muir Sat, 08 Apr 2023 04:04:53 -0700

I don't think we have. The performance needs to be reasonable in order
to bump this limit. Otherwise bumping this limit makes the worst-case
2x worse than it already is!


Moreover, its clear something needs to happen to address the
scalability/lack of performance. I'd hate for this limit to be in the
way of that. Because of backwards compatibility, it's a one-way,
permanent, irreversible change.

I'm not sold by any means in any way yet. My vote remains the same.

On Fri, Apr 7, 2023 at 10:57 PM Michael Wechner
<[email protected]> wrote:
>
> sorry to interrupt, but I think we get side-tracked from the original 
> discussion to increase the vector dimension limit.
>
> I think improving the vector indexing performance is one thing and making 
> sure Lucene does not crash when increasing the vector dimension limit is 
> another.
>
> I think it is great to find better ways to index vectors, but I think this 
> should not prevent people from being able to use models with higher vector 
> dimensions than 1024.
>
> The following comparison might not be perfect, but imagine we have invented a 
> combustion engine, which is strong enough to move a car in the flat area, but 
> when applying it to a truck to move things over mountains it will fail, 
> because it is not strong enough. Would you prevent people from using the 
> combustion engine for a car in the flat area?
>
> Thanks
>
> Michael
>
>
>
> Am 08.04.23 um 00:15 schrieb jim ferenczi:
>
> > Keep in mind, there may be other ways to do it. In general if merging
> something is going to be "heavyweight", we should think about it to
> prevent things from going really bad overall.
>
> Yep I agree. Personally I don t see how we can solve this without prior 
> knowledge of the vectors. Faiss has a nice implementation that fits naturally 
> with Lucene called IVF (
> https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVF.html)
> but if we want to avoid running kmeans on every merge we d require to provide 
> the clusters for the entire index before indexing the first vector.
> It s a complex issue…
>
> On Fri, 7 Apr 2023 at 22:58, Robert Muir <[email protected]> wrote:
>>
>> Personally i'd have to re-read the paper, but in general the merging
>> issue has to be addressed somehow to fix the overall indexing time
>> problem. It seems it gets "dodged" with huge rambuffers in the emails
>> here.
>> Keep in mind, there may be other ways to do it. In general if merging
>> something is going to be "heavyweight", we should think about it to
>> prevent things from going really bad overall.
>>
>> As an example, I'm most familiar with adding DEFLATE compression to
>> stored fields. Previously, we'd basically decompress and recompress
>> the stored fields on merge, and LZ4 is so fast that it wasn't
>> obviously a problem. But with DEFLATE it got slower/heavier (more
>> intense compression algorithm), something had to be done or indexing
>> would be unacceptably slow. Hence if you look at storedfields writer,
>> there is "dirtiness" logic etc so that recompression is amortized over
>> time and doesn't happen on every merge.
>>
>> On Fri, Apr 7, 2023 at 5:38 PM jim ferenczi <[email protected]> wrote:
>> >
>> > I am also not sure that diskann would solve the merging issue. The idea 
>> > describe in the paper is to run kmeans first to create multiple graphs, 
>> > one per cluster. In our case the vectors in each segment could belong to 
>> > different cluster so I don’t see how we could merge them efficiently.
>> >
>> > On Fri, 7 Apr 2023 at 22:28, jim ferenczi <[email protected]> wrote:
>> >>
>> >> The inference time (and cost) to generate these big vectors must be quite 
>> >> large too ;).
>> >> Regarding the ram buffer, we could drastically reduce the size by writing 
>> >> the vectors on disk instead of keeping them in the heap. With 1k 
>> >> dimensions the ram buffer is filled with these vectors quite rapidly.
>> >>
>> >> On Fri, 7 Apr 2023 at 21:59, Robert Muir <[email protected]> wrote:
>> >>>
>> >>> On Fri, Apr 7, 2023 at 7:47 AM Michael Sokolov <[email protected]> 
>> >>> wrote:
>> >>> >
>> >>> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994)
>> >>> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer 
>> >>> > size=1994)
>> >>> >
>> >>> > Robert, since you're the only on-the-record veto here, does this
>> >>> > change your thinking at all, or if not could you share some test
>> >>> > results that didn't go the way you expected? Maybe we can find some
>> >>> > mitigation if we focus on a specific issue.
>> >>> >
>> >>>
>> >>> My scale concerns are both space and time. What does the execution
>> >>> time look like if you don't set insanely large IW rambuffer? The
>> >>> default is 16MB. Just concerned we're shoving some problems under the
>> >>> rug :)
>> >>>
>> >>> Even with the yuge RAMbuffer, we're still talking about almost 2 hours
>> >>> to index 4M documents with these 2k vectors. Whereas you'd measure
>> >>> this in seconds with typical lucene indexing, its nothing.
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: [email protected]
>> >>> For additional commands, e-mail: [email protected]
>> >>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [Proposal] Remove max number of dimensions for KNN vectors

Reply via email to