msokolov opened a new pull request #2173: URL: https://github.com/apache/lucene-solr/pull/2173
This PR has a standalone test program that demonstrates some interesting performance characteristics of different data access patterns. It's not intended to be merged; I'm just putting it up here for visibility and discussion. I have been working on improving the performance of vector KNN search, and in particular working on closing the gap with the nmslib/hnswlib reference implementation. hnswlib is coded in C++, but I believe a pure Java implementation should be able to provide pretty close performance. My goal has been to get within 2x, but we're still pretty far off from that, maybe 8x difference. Looking at the profiler, we seem to spend all our time in dot product computation, which is expected. So I wrote a simple benchmark to look more closely at this aspect and stumbled on something that really surprised me, demonstrated by this program. The speed of the bulk dot product computation we do (a thousand or so per query, for an index of 1M vectors) is heavily influenced by memory access patterns. I'm guessing it has something to do with ability to use the CPU's memory caches, avoiding the need to go out to main memory. In this micro-benchmark I compared a few different memory access patterns, computing 1M dot products across pairs of vectors taken from the same set. The fastest is to load all floats into a single contiguous on-heap array, and access that via pointers (which is like what `hnswlib` does). I compared that with various other models, including something simulating what we do today in `Lucene`, memory mapping a file, reading it as bytes, and converting that into a float array for each access. If we access the vector data sequentially, there is a 4x difference in speed, but even for random access there is nearly a 2x difference. The MANY ARRAYS case pre-loads all vectors on heap, but stores them in separate arrays per vector, rather than in a single contiguous array. The SKIP ONE COPY case is like the BASELINE, but simulates what we might see if we implemented `IndexInput.readFloats`, so we could avoid one array copy that's needed today. ## random access pattern time/iteration BASELINE 0.594572 us SKIP ONE COPY 0.401249 us MANY ARRAYS 0.393746 us ONE BIG ARRAY 0.330135 us ## sequential access pattern time/iteration BASELINE 0.443061 us MANY ARRAYS 0.188859 us SKIP ONE COPY 0.154549 us ONE BIG ARRAY 0.109249 us ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
