LantaoJin commented on PR #16214: URL: https://github.com/apache/lucene/pull/16214#issuecomment-4666315741
> in that situation you typically need to reindex everything Thanks @jimczi, @iverase — this is the right thing to pressure-test. Let me give the concrete shape of the use case, because I think it narrows the disagreement. **The index**: one vector field + ~49 other fields (analyzed text, stored, doc-values). The 49 fields are **immutable** for the lifetime of the index. Only the embedding changes: we fine-tune **model1 → model2** (same dimension, same similarity), and want to refresh the vector field for the affected docs. You're right that the **vector field itself is rebuilt** regardless -- every re-embedded doc changes its vector, so the flat column is fully rewritten and the HNSW graph fully rebuilt no matter which path we take. In-place doesn't claim to save any vector-side work, and I agree that "you reindex the vector field anyway" is true. The cost it removes is the **other 49 fields**. Today the only way to refresh the vectors is to build a new index with all 50 fields and drop the old one -- which re-analyzes and rewrites the 49 unchanged fields for zero benefit, and requires the caller to still **possess every field's source value** (often it lives in a separate source-of-truth system and isn't cheaply reconstructable). `updateDocument` has the same problem: it's whole-document, so it drags the 49 fields along. In our indices the 49 fields are the bulk of the indexing cost, so skipping them is the actual win -- not the vector write. So I'd reframe the value prop as: update the embedding given only `(id, newVector)`, without rebuilding or even possessing the rest of the document. The model-version bump is one instance; partial/rolling re-embedding of a subset is another. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
