yx-keith opened a new pull request, #63913:
URL: https://github.com/apache/doris/pull/63913
AnnIndexColumnWriter buffers vectors and flushes a full chunk
(ann_index_build_chunk_size, default 1,000,000 rows) at a time, calling train()
then add() on each flush. Re-invoking train() per chunk is wrong:
- For IVF with a SQ/PQ quantizer, train() re-fits the scalar/product
codebook on the latest chunk (FAISS IndexIVF::train runs train_encoder with no
is_trained guard). Vectors from earlier chunks were already add()ed and encoded
under the previous codebook, so after the final train() their stored codes no
longer match the codebook used at search time, corrupting recall on any segment
larger than one chunk.
- For the default IVFFlat the coarse-quantizer k-means is guarded
(Level1Quantizer::train_q1 returns early once quantizer->ntotal == nlist), so
recall is not corrupted, but every chunk still pays a redundant train() call
(and its OpenMP build-budget mutex).
The writer's own finish() comment already documents the intended contract
("the quantizer was already trained on previous batches, just add"), but the
per-chunk loop never honored it.
Train at most once per index lifetime:
- Add VectorIndex::needs_training(); FaissVectorIndex returns true for IVF /
IVF_ON_DISK and for any non-FLAT quantizer, false for HNSW+FLAT.
- Route both train() call sites through _train_once_if_needed(), guarded by
a _trained flag, so train() runs at most once and HNSW+FLAT skips it entirely.
Tests:
- MockVectorIndex gains a needs_training() mock; the IVF/PQ/SQ writer tests
now assert train() is invoked exactly once rather than per chunk.
- New TestHnswSkipsTrainCall pushes multiple full chunks through an index
that reports needs_training()==false and asserts zero train() calls.
- New NeedsTrainingMatrix unit-tests FaissVectorIndex::needs_training()
across index types and quantizers.
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
### Release note
None
### Check List (For Author)
- Test <!-- At least one of them must be included. -->
- [ ] Regression test
- [ ] Unit Test
- [ ] Manual test (add detailed scripts or steps below)
- [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change.
- [ ] No code files have been changed.
- [ ] Other reason <!-- Add your reason? -->
- Behavior changed:
- [ ] No.
- [ ] Yes. <!-- Explain the behavior change -->
- Does this need documentation?
- [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
https://github.com/apache/doris-website/pull/1214 -->
### Check List (For Reviewer who merge this PR)
- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR should
merge into -->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]