zhiqiang-hhhh opened a new pull request, #60358:
URL: https://github.com/apache/doris/pull/60358
### 1. Background of the Problem
In Doris's ANN (Approximate Nearest Neighbor) index implementation,
quantizers (such as PQ - Product Quantization and SQ - Scalar Quantization)
require sufficient training data to build effective indexes. If the training
data is insufficient, index construction may fail or be skipped, leading to
degraded query performance or errors.
Initially, the code only considered the minimum training rows for the IVF
(Inverted File) index type (based on the number of clusters `nlist`), but did
not adequately account for the quantizer's own requirements. Additionally, the
test suite quantizer_min_train_rows.groovy only covered the IVF index type and
did not include the HNSW (Hierarchical Navigable Small World) index type, which
also supports quantizers.
The user discovered:
- The HNSW index type was set as the default in the code, but its minimum
training rows calculation was incomplete.
- The regression tests lacked test cases for HNSW, resulting in incomplete
coverage.
- The formulas for minimum training rows for PQ and SQ needed adjustment
based on FAISS implementation to ensure accuracy.
This could cause tests to fail or inaccurately validate index construction
logic.
### 2. How It Was Resolved
#### a. Modified the `get_min_train_rows()` Function in faiss_ann_index.cpp
- **Original Logic**: Only returned IVF's `nlist` or 0 (for HNSW).
- **New Logic**:
- Calculate IVF minimum rows: If `index_type == IVF`, then `ivf_min =
ivf_nlist`; otherwise, `ivf_min = 0`.
- Calculate quantizer minimum rows:
- For PQ: `pq_m * (1 << pq_nbits) * 256` (based on FAISS's codebook
count and empirical multiplier).
- For SQ: Fixed at 20 (empirical value, applicable to both IVF and HNSW).
- Return `std::max(ivf_min, quantizer_min)` to ensure both IVF and
quantizer requirements are met.
- **Purpose**: Enable HNSW to correctly check quantizer training data,
preventing the construction of invalid indexes when data is insufficient.
#### b. Updated the Regression Test quantizer_min_train_rows.groovy
- **Added HNSW Test Cases**:
- HNSW PQ insufficient: Insert 3 rows (< 131072), verify index
construction is skipped.
- HNSW SQ insufficient: Insert 3 rows (< 20), verify index construction is
skipped.
- HNSW PQ sufficient: Insert 2048 rows (>= 2048), verify index
construction succeeds.
- HNSW SQ sufficient: Insert 12 rows (>= 20), verify index construction
succeeds.
- **Adjusted Existing IVF Tests**:
- Updated PQ sufficient test: Insert 2048 rows (previously incorrectly
inserted 200 rows, causing test failure).
- Updated comments: Changed from `nlist * 2` to `20` for SQ, and clarified
PQ formula as `pq_m * (1 << pq_nbits) * 256`.
- **Cleanup**: Added drop statements for HNSW tables.
- **Purpose**: Ensure tests cover all index types and quantizer
combinations, validating that data insertion succeeds but index construction is
skipped or performed based on min_train_rows.
#### c. Verification
- Ran regression tests: Confirmed all test cases pass (exit code 0), with
normal data insertion and correct index construction logic.
### 3. Notes
- **Compatibility**: Changes only affect ANN index training data checks and
do not impact existing queries or index usage. Ensure sufficient data volume in
production to avoid index construction failures.
- **Performance Impact**: Minimum training rows checks occur during index
construction and do not affect query performance. If data is insufficient,
indexes are not built, and users must insert more data and rebuild manually.
- **Test Coverage**: New HNSW tests ensure comprehensive coverage. Note
parameter differences between IVF and HNSW (HNSW lacks `nlist` but has
`max_degree` and `ef_construction`).
- **Formula Accuracy**: PQ formula is based on FAISS codebook count (`pq_m *
2^pq_nbits`) multiplied by empirical factor 256. SQ's 20 is an empirical value
and may need adjustment based on actual FAISS behavior.
- **Error Handling**: If training data is insufficient, the code skips index
construction and logs it, without throwing exceptions, ensuring data insertion
succeeds.
- **Version Dependencies**: These changes rely on FAISS library
implementation; ensure third-party library versions are consistent.
### 4. How min_train_rows Is Calculated
`min_train_rows` is calculated via `FaissVectorIndex::get_min_train_rows()`,
returning the larger value of IVF and quantizer requirements:
- **IVF Minimum Rows (`ivf_min`)**:
- If `index_type == IVF`, then `ivf_min = ivf_nlist` (number of clusters).
- Otherwise, `ivf_min = 0` (no such requirement for HNSW).
- **Quantizer Minimum Rows (`quantizer_min`)**:
- If `quantizer == PQ`: `quantizer_min = pq_m * (1 << pq_nbits) * 256`
- `pq_m`: Number of sub-quantizers.
- `pq_nbits`: Bits per sub-quantizer, `1 << pq_nbits` is the codebook
size (2^pq_nbits).
- 256: Empirical multiplier to ensure sufficient samples per codebook.
- If `quantizer == SQ4` or `SQ8`: `quantizer_min = 20` (fixed empirical
value).
- For other quantizers (e.g., FLAT): `quantizer_min = 0`.
- **Final Result**: `std::max(ivf_min, quantizer_min)`.
**Examples**:
- IVF + PQ (pq_m=2, pq_nbits=8, nlist=10): max(10, 2*256*256) = max(10,
131072) = 131072.
- HNSW + SQ: max(0, 20) = 20.
This calculation ensures sufficient training data during index construction
to avoid invalid indexes.
### Check List (For Author)
- Test <!-- At least one of them must be included. -->
- [ ] Regression test
- [ ] Unit Test
- [ ] Manual test (add detailed scripts or steps below)
- [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change.
- [ ] No code files have been changed.
- [ ] Other reason <!-- Add your reason? -->
- Behavior changed:
- [ ] No.
- [ ] Yes. <!-- Explain the behavior change -->
- Does this need documentation?
- [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
https://github.com/apache/doris-website/pull/1214 -->
### Check List (For Reviewer who merge this PR)
- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR should
merge into -->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]