msokolov opened a new pull request #2173:
URL: https://github.com/apache/lucene-solr/pull/2173


   This PR has a standalone test program that demonstrates some interesting 
performance characteristics of different data access patterns. It's not 
intended to be merged; I'm just putting it up here for visibility and 
discussion.
   
   I have been working on improving the performance of vector KNN search, and 
in particular working on closing the gap with the nmslib/hnswlib reference 
implementation. hnswlib is coded in C++, but I believe a pure Java 
implementation should be able to provide pretty close performance. My goal has 
been to get within 2x, but we're still pretty far off from that, maybe 8x 
difference. Looking at the profiler, we seem to spend all our time in dot 
product computation, which is expected. So I wrote a simple benchmark to look 
more closely at this aspect and stumbled on something that really surprised me, 
demonstrated by this program.
   
   The speed of the bulk dot product computation we do (a thousand or so per 
query, for an index of 1M vectors) is heavily influenced by memory access 
patterns. I'm guessing it has something to do with ability to use the CPU's 
memory caches, avoiding the need to go out to main memory.
   
   In this micro-benchmark I compared a few different memory access patterns, 
computing 1M dot products across pairs of vectors taken from the same set. The 
fastest is to load all floats into a single contiguous on-heap array, and 
access that via pointers (which is like what `hnswlib` does). I compared that 
with various other models, including something simulating what we do today in 
`Lucene`, memory mapping a file, reading it as bytes, and converting that into 
a float array for each access. If we access the vector data sequentially, there 
is a 4x difference in speed, but even for random access there is nearly a 2x 
difference.
   
   The MANY ARRAYS case pre-loads all vectors on heap, but stores them in 
separate arrays per vector, rather than in a single contiguous array. The SKIP 
ONE COPY case is like the BASELINE, but simulates what we might see if we 
implemented `IndexInput.readFloats`, so we could avoid one array copy that's 
needed today.
   
   ## random access 
   pattern                  time/iteration
   BASELINE             0.594572 us
   SKIP ONE COPY  0.401249 us
   MANY ARRAYS    0.393746 us
   ONE BIG ARRAY  0.330135 us
   
   ## sequential access 
   pattern                        time/iteration
   BASELINE                  0.443061 us
   MANY ARRAYS          0.188859 us
   SKIP ONE COPY        0.154549 us
   ONE BIG ARRAY        0.109249 us
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to