abernardi597 commented on PR #15472: URL: https://github.com/apache/lucene/pull/15472#issuecomment-4012846854
Here are some of the results from my benchmarking on the Cohere V3 dataset. ## Indexing On a `m8g.16xlarge` EC2 instance running AL2, I indexed 5M documents using 16 indexing threads and 48 merge threads with both Lucene's HNSW codec and the candidate JVector codec over a range of compression levels. For HNSW indices, this involves using Optimized Scalar Quantization, while JVector relies on Product Quantization for compression. Lucene's OSQ supports 8-bit, 4-bit, and 1-bit quantization achieving 4x, 8x, and 32x compression, respectively. Product Quantization supports the same compression levels by setting the number of subvectors appropriately (i.e. 1-dim subvectors = 4x compression, 2-dim subvectors = 8x, etc.). All indices were created with `maxConn=64` and `beamWidth=250` using the COSINE similarity metric. For JVector, I set `alpha=1.0`, `neighborOverflow=1.2`, and `useHierarchy=true` to generate graphs with a similar structure to HNSW and mitigate differences in recall due to graph quality.   The graph above shows the wall-clock time to index all 5M documents, where the slightly transparent section of the bar denotes the additional time taken to force-merge the segments into a single graph. Overall, indexing time is quite comparable in the case of 1-bit quantization or full-precision vectors, but Lucene shows much better performance with 4-bit and 8-bit quantization. In every case, however, JVector seems to spend an inordinate amount of time in merging segments compared to HNSW. The discrepancy is likely due to a key optimization found in Lucene's HNSW implementation that re-uses adjacency information from the incoming graphs when merging to reduce search costs for inserting most nodes into the graph. I did not port this functionality, but I am interested in contributing it upstream. ## Searching The following search benchmarks perform 10k queries across 8 threads using Lucene's `MMapDirectory` to memory-map files in the index. The `topK=100` nearest neighbors for each target vector are compared to the ground-truth computed by brute-force to compute the recall. For each index, I ran the benchmarks using `overSample=1..5`, both with and without rerank. The benchmarks execute each query as a warm-up before measuring anything. To get an idea of the best-case performance, I first ran search benchmarks on the same `m8g.16xlarge` instance I used to build the index, since it can hold the entire index in memory and simulate a "hot" index. I used the Lucene preload API to ensure that the index files were loaded into memory. I also wanted to run the benchmarks in a memory-constrained environment to simulate a "warm" index so I ran the benchmarks on a `c8gd.xlarge` instance running AL2023, copying the index files onto the direct-attached SSD formatted with xfs to simulate a "warm" index. The benchmarks do not attempt to preload the index files in this case, since they cannot all fit in the limited memory of the machine. Moreover, `MADV_RANDOM` is used for the graph index and vector files to minimize IO saturation due to overzealous kernel prefetching despite mostly random disk access.   This plot of recall over the average CPU time per query illustrates how over-sampling and re-ranking affect latency and recall. Each line connects the points with increasing over-sampling. Dashed lines connect the data points with reranking, while the solid lines connect those without. From this we can see that generally Lucene outperforms JVector for "hot" indices. It does seem that JVector benefits more from reranking without full-precision vectors, though this may be an artifact of PQ more than anything. It is also notable that JVector 32x compressed has lower latencies than HNSW, but again its hard to say whether that is a win of the JVector algorithm rather than its implementation of PQ. For the "warm" index benchmarks, it seems the performance for JVector is less affected by reranking. This would indicate the advantage of DiskANN-style reranking where full-precision vectors are gathered inline with the graph search (relying on page locality to get them for "free"), rather than an explicit second phase for reranking. At the time of writing, however, it seems that JVector also [performs re-ranking in second phase](https://github.com/datastax/jvector/blob/3ca3ce10b20d807c95f4843cc989ae0faf712a54/jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphSearcher.java#L229), like Lucene HNSW. This leads me to believe the discrepancy is due to how JVector explicitly loads quantized vectors onto the heap, whereas HNSW is memory-mapping the quantized vectors. The page cache may be more effective at keeping full-precision vectors (which are likely also loaded when traversing the index, due to locality) resident, despite less cache space due to increased heap usage, when it does not need to juggle more disk accesses. --- Overall, it seems like Lucene HNSW is more competitive with JVector than [some benchmarks](https://github.com/opensearch-project/opensearch-jvector/blob/main/README.md#benchmarks) would suggest. Even so, it could still be worth adding it to the sandbox as an alternative KNN engine, or perhaps trying to incorporate some of its ideas/features into Lucene. I will update this PR with a non-draft revision, pending some cleanup of my commits and `knnPerfTest.py` PRs in `luceneutil`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
