vigyasharma commented on PR #13525:
URL: https://github.com/apache/lucene/pull/13525#issuecomment-2487597269
_...contd. from above – thoughts on supporting independent multi-vectors
specified via `NONE` multi-vector aggregation..._
__
The `Knn{Float|Byte}Vector` fields will accept multiple vector values for
documents. Each vector value will be uniquely identifiable by a nodeId. Vectors
for a doc will be stored adjacent to each other in flat storage.
KnnVectorValues will support APIs for 1) getting docId for a given nodeId
(existing), 2) getting vector value for a specific nodeId (existing), 3)
getting all vector values for the document corresponding to a nodeId (new).
Our codec today has single unique sequentially increasing vector ordinal per
doc, which we can store and fetch with the DirectMonotonicWriter. For
multi-vectors, we need to handle multiple nodeIds mapping to a single document.
I'm thinking of using "ordinals" and "sub-ordinals" to identify each vector
value. 'Ordinal' is incremented when docId changes. 'Sub-ordinals' start at 0
for each new doc and are incremented for subsequent vector values in the doc. A
nodeId in the graph, is a "long" with ordinals and sub-ordinals packed into MSB
and LSB bits separately.
For flat storage, we can continue to use the technique in this PR; i.e. have
one DirectMonotonicWriter object for docIds indexed by "ordinals", and another
that stores start offsets for each docId, again indexed by ordinals. The
sub-ordinal bits help us seek to exact vector values from this metadata.
```java
int ordToDoc(long nodeId) {
// get int ordinal from most-significant 32 bits
// get docId for the ordinal from DirectMonotonicWriter
}
float[] vectorValue(int nodeId) {
// get int ordinal from most-significant 32 bits
// get "startOffset" for ordinal
// get subOrdinal from least-signifant 32 bits
// read vector value from startOffset + (subOrdinal * dimension * byteSize)
}
float[] getAllVectorValues(int nodeId) {
// get int ordinal from most-significant 32 bits
// get "startOffset" for ordinal
// get "endOffset" from offset value for ordinal + 1
// return values from [startOffset, endOffset)
}
```
With this setup, we won't need parent-block join queries for multiple vector
values. And we can use `getAllVectorValues()` for scoring with max or avg of
all vectors in the doc at query time.
I'm skeptical if this'll give a visible performance boost. It should at
least be similar to the block-join setup we have today, but hopefully more
convenient to use. And it sets us up for "dependent" multi-vector values like
ColBERT.
We'll need to code this up to iron out any wrinkles. I can work on a draft
PR if the idea makes sense.
__
Note that this still doesn't allow >2B vector values. While the "long"
nodeId can support it, our ANN impl. returns arrays containing all nodeIds is
various places. I don't think java can support >2B array length. But we can
address this limitation separately, perhaps with a different ANN algo for such
high cardinality graphs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]