kevindrosendahl commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1888208867

   > How is segment merging implemented by Lucene Vamana?
   
   I didn't do anything special for Vamana in these experiments, the index 
construction and merging are practically identical to HNSW. The implication of 
course here is that the vectors need to be in memory when building the merged 
graph or performance falls off a cliff. To make this useful operationally you'd 
want to ensure your max segment size does not exceed available memory, and 
ideally you would build the index on a dedicated node to not displace pages 
being used to server queries.
   
   The DiskANN paper suggests clustering the vectors and assigning each vector 
to the two closest clusters, building a graph for each vector, then overlaying 
the graphs and doing another round of pruning. You can then limit the amount of 
memory required based off the number of vectors and clusters. I didn't explore 
this, but it'd be a worthwhile improvement in the long term.
   
   > Correct me if I'm wrong, looks like the Java ANN implementations examine 
the node ids in a more or less random order, filtering requires a `Bits` 
object. I've been wondering if SPANN solves this by enabling ascending doc id 
intersection.
   
   That's correct, the graph is explored (roughly) by proximity order rather 
than doc id. It's an interesting question about SPANN, intuitively it seems 
like you may be able to do as you describe by keeping each vector postings list 
sorted by doc id order, then essentially doing a streaming merge sort on the 
relevant centroids' postings as you consume the candidates and iterate through 
the passed in sorted doc id iterator.
   
   > I have general concerns about large copy overheads
   
   I believe all of today's IO layer implementations in Lucene ([including mmap 
on java 
21](https://github.com/apache/lucene/blob/8bee41880e41024109bf5729584ebc5dd1003717/lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java#L214))
 copy the bytes from the page cache onto the heap one way or another. This 
could potentially be improved in the future by passing a `MemorySegment` 
around, but it seems to have gotten things pretty far so far without problems.
   
   > I know that is a newer Linux kernel module/interface, so that would lead 
me to believe actually taking this implementation forward would require a bit 
of comprehensive testing. ... The other concern I have is that I suspect the 
community would probably need to keep an eye on io_uring given its age.
   
   Yeah, I wouldn't suggest that this be the default implementation, or perhaps 
even included in Lucene core itself at all given how platform specific it is 
(and this POC at least required a small amount ~50 LOC of C glue code linking 
in `liburing` to make things smooth). My main goal here was to highlight to the 
community a few things that may be relevant beyond vector indexes and DiskANN:
   - modern parallel I/O can provide significant performance improvements if 
you're able to remove I/O dependencies in the algorithms
   - the kernel is not doing a great job in this case of determining the best 
pages to keep in memory
   
   There's clearly a ton of perf left on the table by using sequential I/O and 
only the page cache in this case, but it's of course also much simpler both in 
terms of implementation and operational complexity.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to