[ https://issues.apache.org/jira/browse/LUCENE-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17227534#comment-17227534 ]
Michael Sokolov commented on LUCENE-9583: ----------------------------------------- I worked up a version of LUCENE-9004 that uses docids in all public APIs, while still exposing random access via docid. This led to a measurable slowdown since it requires mapping back-and-forth between docids and ordinals. These are results on an internal dataset. I'm showing the median latency of three runs in each case. The net is about a 10% increase in query latency and about 28% increase in indexing time. h3. using ordinal API ||recall|| latency|| nDoc|| fanout|| maxConn|| beamWidth|| visited|| index ms|| |0.914| 1.73| 1000000 |0| 64| 500| 2126| 4628273| |0.921 |1.86| 1000000 |10| 64| 500| 2260| 0| |0.924 |2.02 |1000000 |20 |64 |500 |2389 |0| |0.941 | 2.27 |1000000 |40 |64| 500 |2644| 0| h3. using docId-only API ||recall|| latency|| nDoc|| fanout|| maxConn|| beamWidth|| visited|| index ms|| |0.910| 1.92| 1000000| 0| 64| 500| 2084| 5929137| |0.920| 2.05| 1000000| 10| 64| 500| 2217| 0| |0.949| 2.21| 1000000| 20| 64| 500| 2399| 0| |0.959| 2.51| 1000000| 40| 64| 500| 2671| 0| Please note that there is precedence for exposing "internal" ordinals as part of our API in \{SortedDocValues}, so we shouldn't shy away from that if it brings value. I haven't had time to try out forward-only iteration, but I do expect it would introduce some marginal performance regression and considerably complicate the implementation of hnsw at least. Finally I'll remind everyone that we have a perfectly good forward-only iteration API for fetching binary data (BinaryDocValues), and that the genesis of this format was indeed the need for random access over vectors. I'd appreciate it if folks with concerns could review the attached PR, which I think does a credible job of moving the random-access API into a place where it doesn't intrude on the main VectorValues API. That patch has been out for a week or so and I plan to push it soon if there are no further comments there (thanks for approving, @mccandless!). I recognize this topic is somewhat controversial, but I believe we can make rapid progress by iterating on code and measuring results. > How should we expose VectorValues.RandomAccess? > ----------------------------------------------- > > Key: LUCENE-9583 > URL: https://issues.apache.org/jira/browse/LUCENE-9583 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Michael Sokolov > Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > In the newly-added {{VectorValues}} API, we have a {{RandomAccess}} > sub-interface. [~jtibshirani] pointed out this is not needed by some > vector-indexing strategies which can operate solely using a forward-iterator > (it is needed by HNSW), and so in the interest of simplifying the public API > we should not expose this internal detail (which by the way surfaces internal > ordinals that are somewhat uninteresting outside the random access API). > I looked into how to move this inside the HNSW-specific code and remembered > that we do also currently make use of the RA API when merging vector fields > over sorted indexes. Without it, we would need to load all vectors into RAM > while flushing/merging, as we currently do in > {{BinaryDocValuesWriter.BinaryDVs}}. I wonder if it's worth paying this cost > for the simpler API. > Another thing I noticed while reviewing this is that I moved the KNN > {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} > to {{VectorValues.RandomAccess}}. This I think we could move back, and > handle the HNSW requirements for search elsewhere. I wonder if that would > alleviate the major concern here? -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org