LantaoJin commented on PR #16214:
URL: https://github.com/apache/lucene/pull/16214#issuecomment-4667122608

   
   ### Usage
   
   The API resolves `term` to the matching document(s) — exactly like 
`updateDocValues` — and overlays
   the new vector on `field`. Address each document by a **unique id term** (an 
un-analyzed
   `StringField`), the same convention as `updateDocument(Term, ...)`.
   
   **Scenario 1 — update a single document's embedding.**
   
   ```java
   try (IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig())) {
     float[] newEmbedding = model.embed(textForDoc42); // 1 x dim, same 
dim/similarity as the field
     writer.updateFloatVectorValue(new Term("id", "doc-42"), "vec", 
newEmbedding);
     writer.commit();
   }
   ```
   
   **Scenario 2 — partial re-embed of a subset (drift correction, fixing a bad 
batch).**
   Only the listed docs are touched; every other document — and all of its 
fields — is left byte-for-byte
   unchanged.
   
   ```java
   try (IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig())) {
     for (Map.Entry<String, float[]> e : reembeddedById.entrySet()) {
       writer.updateFloatVectorValue(new Term("id", e.getKey()), "vec", 
e.getValue());
     }
     writer.commit(); // each touched segment writes ONE new-generation flat 
column (not one per call)
   }
   ```
   
   **Scenario 3 — re-embed the whole vector field after a model fine-tune (the 
motivating case).**
   An index with one vector field + N immutable non-vector fields; `model1 → 
model2` is a fine-tune, so
   the dimension and similarity are unchanged. We refresh the embedding for 
every document **without
   rebuilding or possessing the N other fields**, then merge once to rebuild 
the HNSW graph.
   
   ```java
   try (IndexReader reader = DirectoryReader.open(dir);
        IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig())) {
   
     StoredFields stored = reader.storedFields();
     Bits live = MultiBits.getLiveDocs(reader);
     for (int docID = 0; docID < reader.maxDoc(); docID++) {
       if (live != null && !live.get(docID)) {
         continue; // skip deleted docs
       }
       String id = stored.document(docID).get("id");
       float[] newEmbedding = model2.embed(/* source for this doc */);
       writer.updateFloatVectorValue(new Term("id", id), "vec", newEmbedding);
     }
     writer.commit();      // flush the buffered overlays: one new-gen flat 
column per segment
     writer.forceMerge(1); // rebuild the HNSW graph ONCE; merged segment is 
back to vectorGen == -1
   }
   ```
   
   > Between `commit()` and the merge, the updated segments carry the new flat 
vectors but no graph, so
   > ANN search on them falls back to an exact scan (correct, but slower) — the 
same deferral
   > `updateDocument` / `updateDocValues` apply to graph / secondary-structure 
work. The `forceMerge`
   > rebuilds the graph once, optimally, restoring approximate search. (For 
workloads that cannot
   > tolerate that window, eager rebuild is a planned follow-up — see *Scope / 
limitations*.)
   
   **Scenario 4 — byte vectors.** Identical shape with `updateByteVectorValue`:
   
   ```java
   writer.updateByteVectorValue(new Term("id", "doc-42"), "vec", 
newByteEmbedding); // byte[] of length dim
   ```
   
   **Scenario 5 — read the update back (committed or NRT).**
   
   ```java
   writer.updateFloatVectorValue(new Term("id", "doc-1"), "vec", newEmbedding);
   try (DirectoryReader reader = DirectoryReader.open(writer)) { // 
near-real-time, no commit needed
     for (LeafReaderContext ctx : reader.leaves()) {
       FloatVectorValues values = ctx.reader().getFloatVectorValues("vec");
       // ... iterate; doc-1 now reflects newEmbedding, every other doc is 
unchanged
     }
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to