LantaoJin commented on PR #16214:
URL: https://github.com/apache/lucene/pull/16214#issuecomment-4667122608
### Usage
The API resolves `term` to the matching document(s) — exactly like
`updateDocValues` — and overlays
the new vector on `field`. Address each document by a **unique id term** (an
un-analyzed
`StringField`), the same convention as `updateDocument(Term, ...)`.
**Scenario 1 — update a single document's embedding.**
```java
try (IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig())) {
float[] newEmbedding = model.embed(textForDoc42); // 1 x dim, same
dim/similarity as the field
writer.updateFloatVectorValue(new Term("id", "doc-42"), "vec",
newEmbedding);
writer.commit();
}
```
**Scenario 2 — partial re-embed of a subset (drift correction, fixing a bad
batch).**
Only the listed docs are touched; every other document — and all of its
fields — is left byte-for-byte
unchanged.
```java
try (IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig())) {
for (Map.Entry<String, float[]> e : reembeddedById.entrySet()) {
writer.updateFloatVectorValue(new Term("id", e.getKey()), "vec",
e.getValue());
}
writer.commit(); // each touched segment writes ONE new-generation flat
column (not one per call)
}
```
**Scenario 3 — re-embed the whole vector field after a model fine-tune (the
motivating case).**
An index with one vector field + N immutable non-vector fields; `model1 →
model2` is a fine-tune, so
the dimension and similarity are unchanged. We refresh the embedding for
every document **without
rebuilding or possessing the N other fields**, then merge once to rebuild
the HNSW graph.
```java
try (IndexReader reader = DirectoryReader.open(dir);
IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig())) {
StoredFields stored = reader.storedFields();
Bits live = MultiBits.getLiveDocs(reader);
for (int docID = 0; docID < reader.maxDoc(); docID++) {
if (live != null && !live.get(docID)) {
continue; // skip deleted docs
}
String id = stored.document(docID).get("id");
float[] newEmbedding = model2.embed(/* source for this doc */);
writer.updateFloatVectorValue(new Term("id", id), "vec", newEmbedding);
}
writer.commit(); // flush the buffered overlays: one new-gen flat
column per segment
writer.forceMerge(1); // rebuild the HNSW graph ONCE; merged segment is
back to vectorGen == -1
}
```
> Between `commit()` and the merge, the updated segments carry the new flat
vectors but no graph, so
> ANN search on them falls back to an exact scan (correct, but slower) — the
same deferral
> `updateDocument` / `updateDocValues` apply to graph / secondary-structure
work. The `forceMerge`
> rebuilds the graph once, optimally, restoring approximate search. (For
workloads that cannot
> tolerate that window, eager rebuild is a planned follow-up — see *Scope /
limitations*.)
**Scenario 4 — byte vectors.** Identical shape with `updateByteVectorValue`:
```java
writer.updateByteVectorValue(new Term("id", "doc-42"), "vec",
newByteEmbedding); // byte[] of length dim
```
**Scenario 5 — read the update back (committed or NRT).**
```java
writer.updateFloatVectorValue(new Term("id", "doc-1"), "vec", newEmbedding);
try (DirectoryReader reader = DirectoryReader.open(writer)) { //
near-real-time, no commit needed
for (LeafReaderContext ctx : reader.leaves()) {
FloatVectorValues values = ctx.reader().getFloatVectorValues("vec");
// ... iterate; doc-1 now reflects newEmbedding, every other doc is
unchanged
}
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]