[ https://issues.apache.org/jira/browse/LUCENE-9626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Sokolov updated LUCENE-9626: ------------------------------------ Description: I ran some KNN tests constructing an index under the profiler. ||function || percent CPU || |---------------|-----------------------| |dotProduct| 28%| |PriorityQueue<Neighbor>.insertWithOverflow| 13% + 4%| | PriorityQueue.lessThan| 10%| |TreeSet.add| 4% + 4%| |HashSet.add| 7% (visited list?) + 2%| |BoundedVectorValues.vectorValue| 6%| |HnswGraph.getNeighbors| 6%| |HashSet.init| 3%| The main cost, as we'd expect, is computing dot products, but we also spend a lot of time in the various collections. We do not need a {TreeSet} (used to keep a candidate list); a heap is enough for that. We should also be able to improve the {PriorityQueue<Neighbor>} times by switching to a native int heap ({lessThan} will be faster, at least). And I also noticed in the profiler that we do a lot of autoboxing of Integers today, which we can start to reduce to save on garbage. The idea of this issue is that instead of maintaining a priority queue of Neighbor objects (node id, score) for each node in the graph, we maintain two parallel arrays: one for node ids and one for scores. These can be pre-allocated to max-connections, or perhaps to half of that and then grown, since we see that on average fanout is about half of max-connections. Then we can reimplement {Neighbors}, which is currently a {PriorityQueue<Neighbor>}, as an integer heap, encoding both the score (as a half-width float sortable bits), and the index into the parallel arrays of the node (as a short) in the same integer value, using the score as the high bits so that priority queue sorting is correct. Future issues can tackle replacing the visited {HashSet<Integer>} with some more efficient data structure - perhaps a {SparseBitSet} or native int hash set of some sort. > Represent HNSW neighbors with primitive arrays instead of Neighbor Objects > -------------------------------------------------------------------------- > > Key: LUCENE-9626 > URL: https://issues.apache.org/jira/browse/LUCENE-9626 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael Sokolov > Priority: Major > > I ran some KNN tests constructing an index under the profiler. > ||function || percent CPU || > |---------------|-----------------------| > |dotProduct| 28%| > |PriorityQueue<Neighbor>.insertWithOverflow| 13% + 4%| > | PriorityQueue.lessThan| 10%| > |TreeSet.add| 4% + 4%| > |HashSet.add| 7% (visited list?) + 2%| > |BoundedVectorValues.vectorValue| 6%| > |HnswGraph.getNeighbors| 6%| > |HashSet.init| 3%| > The main cost, as we'd expect, is computing dot products, but we also spend a > lot of time in the various collections. We do not need a {TreeSet} (used to > keep a candidate list); a heap is enough for that. We should also be able to > improve the {PriorityQueue<Neighbor>} times by switching to a native int heap > ({lessThan} will be faster, at least). And I also noticed in the profiler > that we do a lot of autoboxing of Integers today, which we can start to > reduce to save on garbage. > The idea of this issue is that instead of maintaining a priority queue of > Neighbor objects (node id, score) for each node in the graph, we maintain two > parallel arrays: one for node ids and one for scores. These can be > pre-allocated to max-connections, or perhaps to half of that and then grown, > since we see that on average fanout is about half of max-connections. > Then we can reimplement {Neighbors}, which is currently a > {PriorityQueue<Neighbor>}, as an integer heap, encoding both the score (as a > half-width float sortable bits), and the index into the parallel arrays of > the node (as a short) in the same integer value, using the score as the high > bits so that priority queue sorting is correct. > Future issues can tackle replacing the visited {HashSet<Integer>} with some > more efficient data structure - perhaps a {SparseBitSet} or native int hash > set of some sort. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org