kaivalnp commented on issue #14758:
URL: https://github.com/apache/lucene/issues/14758#issuecomment-3246818266

   ### What if we de-duplicate vectors in Lucene?
    
   - Today, we have a 
[`Lucene99FlatVectorsFormat`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsFormat.java)
 responsible for reading / writing raw vectors
        - This format [maintains a 
list](https://github.com/apache/lucene/blob/ac90517c17ef78a469c65868e2026461f6ddcddc/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsWriter.java#L408)
 of vectors _per-field_ during indexing, simply appending to the list for new 
documents
        - All vectors within a field are written in a single "block" during 
flush, one vector after another
        - All "blocks" are present in the same file, one field after another
        - The metadata file maps each field to its "block", i.e. an [offset and 
length](https://github.com/apache/lucene/blob/ac90517c17ef78a469c65868e2026461f6ddcddc/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsWriter.java#L348-L349)
 in the raw vector file
   - What if we combine "homogeneous" fields into a single "block", and 
de-duplicate vectors inside a "block"?
        - For vector data, "homogeneous" fields are ones with the same `byte` / 
`float32` encoding _and_ dimension
        - This proposal has an added side-benefit of de-duplicating vectors 
_within_ a field as well (if the features used for vector generation are 
identical across two documents)
   
   An end-user can simply add new top-level vector fields in the same document 
with the same vector, and Lucene takes care of storing a single unique copy:
   
   ```java
   IndexWriter iw; // init
   
   float[] vector1; // init
   Document doc1 = new Document();
   doc1.add(new KnnFloatVectorField("vector-category-1", vector1));
   doc1.add(new KnnFloatVectorField("vector-category-2", vector1));
   doc1.add(new KnnFloatVectorField("vector-category-3", vector1));
   iw.add(doc1); // only stores one copy of vector1, but creates / adds to HNSW 
graphs 1, 2, 3
   
   float[] vector2; // init
   Document doc2 = new Document();
   doc2.add(new KnnFloatVectorField("vector-category-1", vector2));
   doc2.add(new KnnFloatVectorField("vector-category-3", vector2));
   iw.add(doc2); // only stores one copy of vector2, but creates / adds to HNSW 
graphs 1, 3
   ```
   
   #### Performance considerations
   **Current State:**
   - Within the format, vectors are represented as "ordinals", and a mapping 
from ordinal -> docid is stored in the metadata file
   - Edges of a node in the HNSW graph are stored as a list of ordinals in 
increasing order, and the graph itself is a concatenation of such lists:
        - i.e. if we traverse the vector with "ordinal" `k` during graph 
search, we seek to position `k * maxConn * bytes per ordinal (= 4)` and read 
`maxConn * bytes per ordinal (= 4)` bytes as a list of "ordinals" of size 
`maxConn`
        - We retrieve the vector corresponding to an "ordinal" by seeking to 
`ordinal * dimension * bytes per dimension (= 1 for byte vectors, or 4 for 
float32 vectors)`
   
   **Query Time:**
   If we only store unique copies of each vector:
   - Multiple ordinals can point to the same vector, so we need to maintain an 
additional mapping of ordinal -> position of vector in raw data. This means one 
additional lookup per vector retrieval (TBD store this mapping on or off-heap?)
        - We can also avoid the lookup completely, by "deflating" each 
`list[neighbor ord]` in the HNSW graph to a `list[tuple[neighbor ord, position 
of vector in raw file]]` -- but this will 2x the graph size (TBD if this is 
required or not)
   - Currently neighbors are stored in increasing order of "ordinals", which 
has some data access pattern benefits: if multiple vectors are retrieved in the 
same page. In our proposal, neighbors will need to be stored in increasing 
order of position in the raw file instead, to maintain the same benefits 
(breaking the convention of increasing ordinal IDs, but I don't think this is a 
constraint)
        - Note that our proposal has an added data access pattern benefit for 
"de-duplicated" vectors -- they'll be present at the same physical location and 
a guaranteed OS cache-hit (where earlier, they were present at different 
locations in the raw file, and _may or may not_ result in an OS cache-hit)
   - Another consequence of storing vectors in non-"ordinal" order is that 
exact search iterates vectors in increasing docid (i.e. increasing ordinal) 
order -- but this is not a constraint, since it is going to traverse _all_ 
vectors (i.e. we'll need to traverse vectors in the order of presence in raw 
file, which is doable in theory, but will need some API additions)
   
   **Indexing Time:**
   Indexing _will_ be more expensive (more work to determine "unique" vectors + 
create new / smaller HNSW graphs)
   
   TBD: Ideal order for storing raw vectors?
   - If we maintain insertion order -- unique vectors get new monotonically 
increasing positions in the raw file, and every "hit" (or non-unique vector) 
gets pointed to an existing position. This could be beneficial with something 
like 
[`BpVectorReorderer`](https://github.com/apache/lucene/blob/main/lucene/misc/src/java/org/apache/lucene/misc/index/BpVectorReorderer.java)
 where documents with vectors "closer" to each other have nearby Lucene doc 
IDs. However, merges become expensive -- as we have to combine vectors in 
arbitrary order across two files, maintaining only unique ones -- so everything 
has to be loaded onto RAM (something like `O(1)` insertion time during 
indexing, and `O(n)` time + `O(n)` memory during merging -- this may be a 
non-starter)
   - If we choose a stable sort (ideally one with some data locality benefits) 
-- then merging becomes inexpensive, as vectors are "sorted" in a deterministic 
order -- so two files can be efficiently merged (something like `O(log n)` 
insertion at indexing time, and `O(n)` time + `O(1)` memory during merging). We 
can accept `O(log n)` because it's not asymptomatically worse than HNSW
   
   #### Other considerations
   - We'll need some codec-level plumbing to ensure that the same _instance_ 
(not just class!) of the raw format is shared by all HNSW formats
   - TBD how this whole proposal works out for quantized vectors:
        - Two different non-quantized vectors can map to the same quantized 
vector, but with different scaling and score correction constants
        - Two same non-quantized vectors can map to different quantized 
vectors, based on distribution _within_ the field
   
   This idea is still WIP -- but on a high-level, seems doable without changing 
public APIs..
   Please let me know what you think!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to