I don't think we have. The performance needs to be reasonable in order to bump this limit. Otherwise bumping this limit makes the worst-case 2x worse than it already is!
Moreover, its clear something needs to happen to address the scalability/lack of performance. I'd hate for this limit to be in the way of that. Because of backwards compatibility, it's a one-way, permanent, irreversible change. I'm not sold by any means in any way yet. My vote remains the same. On Fri, Apr 7, 2023 at 10:57 PM Michael Wechner <michael.wech...@wyona.com> wrote: > > sorry to interrupt, but I think we get side-tracked from the original > discussion to increase the vector dimension limit. > > I think improving the vector indexing performance is one thing and making > sure Lucene does not crash when increasing the vector dimension limit is > another. > > I think it is great to find better ways to index vectors, but I think this > should not prevent people from being able to use models with higher vector > dimensions than 1024. > > The following comparison might not be perfect, but imagine we have invented a > combustion engine, which is strong enough to move a car in the flat area, but > when applying it to a truck to move things over mountains it will fail, > because it is not strong enough. Would you prevent people from using the > combustion engine for a car in the flat area? > > Thanks > > Michael > > > > Am 08.04.23 um 00:15 schrieb jim ferenczi: > > > Keep in mind, there may be other ways to do it. In general if merging > something is going to be "heavyweight", we should think about it to > prevent things from going really bad overall. > > Yep I agree. Personally I don t see how we can solve this without prior > knowledge of the vectors. Faiss has a nice implementation that fits naturally > with Lucene called IVF ( > https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVF.html) > but if we want to avoid running kmeans on every merge we d require to provide > the clusters for the entire index before indexing the first vector. > It s a complex issue… > > On Fri, 7 Apr 2023 at 22:58, Robert Muir <rcm...@gmail.com> wrote: >> >> Personally i'd have to re-read the paper, but in general the merging >> issue has to be addressed somehow to fix the overall indexing time >> problem. It seems it gets "dodged" with huge rambuffers in the emails >> here. >> Keep in mind, there may be other ways to do it. In general if merging >> something is going to be "heavyweight", we should think about it to >> prevent things from going really bad overall. >> >> As an example, I'm most familiar with adding DEFLATE compression to >> stored fields. Previously, we'd basically decompress and recompress >> the stored fields on merge, and LZ4 is so fast that it wasn't >> obviously a problem. But with DEFLATE it got slower/heavier (more >> intense compression algorithm), something had to be done or indexing >> would be unacceptably slow. Hence if you look at storedfields writer, >> there is "dirtiness" logic etc so that recompression is amortized over >> time and doesn't happen on every merge. >> >> On Fri, Apr 7, 2023 at 5:38 PM jim ferenczi <jim.feren...@gmail.com> wrote: >> > >> > I am also not sure that diskann would solve the merging issue. The idea >> > describe in the paper is to run kmeans first to create multiple graphs, >> > one per cluster. In our case the vectors in each segment could belong to >> > different cluster so I don’t see how we could merge them efficiently. >> > >> > On Fri, 7 Apr 2023 at 22:28, jim ferenczi <jim.feren...@gmail.com> wrote: >> >> >> >> The inference time (and cost) to generate these big vectors must be quite >> >> large too ;). >> >> Regarding the ram buffer, we could drastically reduce the size by writing >> >> the vectors on disk instead of keeping them in the heap. With 1k >> >> dimensions the ram buffer is filled with these vectors quite rapidly. >> >> >> >> On Fri, 7 Apr 2023 at 21:59, Robert Muir <rcm...@gmail.com> wrote: >> >>> >> >>> On Fri, Apr 7, 2023 at 7:47 AM Michael Sokolov <msoko...@gmail.com> >> >>> wrote: >> >>> > >> >>> > 8M 1024d float vectors indexed in 1h48m (16G heap, IW buffer size=1994) >> >>> > 4M 2048d float vectors indexed in 1h44m (w/ 4G heap, IW buffer >> >>> > size=1994) >> >>> > >> >>> > Robert, since you're the only on-the-record veto here, does this >> >>> > change your thinking at all, or if not could you share some test >> >>> > results that didn't go the way you expected? Maybe we can find some >> >>> > mitigation if we focus on a specific issue. >> >>> > >> >>> >> >>> My scale concerns are both space and time. What does the execution >> >>> time look like if you don't set insanely large IW rambuffer? The >> >>> default is 16MB. Just concerned we're shoving some problems under the >> >>> rug :) >> >>> >> >>> Even with the yuge RAMbuffer, we're still talking about almost 2 hours >> >>> to index 4M documents with these 2k vectors. Whereas you'd measure >> >>> this in seconds with typical lucene indexing, its nothing. >> >>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> >>> For additional commands, e-mail: dev-h...@lucene.apache.org >> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org