kevindrosendahl commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1888208867
> How is segment merging implemented by Lucene Vamana? I didn't do anything special for Vamana in these experiments, the index construction and merging are practically identical to HNSW. The implication of course here is that the vectors need to be in memory when building the merged graph or performance falls off a cliff. To make this useful operationally you'd want to ensure your max segment size does not exceed available memory, and ideally you would build the index on a dedicated node to not displace pages being used to server queries. The DiskANN paper suggests clustering the vectors and assigning each vector to the two closest clusters, building a graph for each vector, then overlaying the graphs and doing another round of pruning. You can then limit the amount of memory required based off the number of vectors and clusters. I didn't explore this, but it'd be a worthwhile improvement in the long term. > Correct me if I'm wrong, looks like the Java ANN implementations examine the node ids in a more or less random order, filtering requires a `Bits` object. I've been wondering if SPANN solves this by enabling ascending doc id intersection. That's correct, the graph is explored (roughly) by proximity order rather than doc id. It's an interesting question about SPANN, intuitively it seems like you may be able to do as you describe by keeping each vector postings list sorted by doc id order, then essentially doing a streaming merge sort on the relevant centroids' postings as you consume the candidates and iterate through the passed in sorted doc id iterator. > I have general concerns about large copy overheads I believe all of today's IO layer implementations in Lucene ([including mmap on java 21](https://github.com/apache/lucene/blob/8bee41880e41024109bf5729584ebc5dd1003717/lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java#L214)) copy the bytes from the page cache onto the heap one way or another. This could potentially be improved in the future by passing a `MemorySegment` around, but it seems to have gotten things pretty far so far without problems. > I know that is a newer Linux kernel module/interface, so that would lead me to believe actually taking this implementation forward would require a bit of comprehensive testing. ... The other concern I have is that I suspect the community would probably need to keep an eye on io_uring given its age. Yeah, I wouldn't suggest that this be the default implementation, or perhaps even included in Lucene core itself at all given how platform specific it is (and this POC at least required a small amount ~50 LOC of C glue code linking in `liburing` to make things smooth). My main goal here was to highlight to the community a few things that may be relevant beyond vector indexes and DiskANN: - modern parallel I/O can provide significant performance improvements if you're able to remove I/O dependencies in the algorithms - the kernel is not doing a great job in this case of determining the best pages to keep in memory There's clearly a ton of perf left on the table by using sequential I/O and only the page cache in this case, but it's of course also much simpler both in terms of implementation and operational complexity. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org