[PR] [WIP](Ann) Cancel index building if input rows is less than the min_train_rows [doris]

via GitHub Thu, 29 Jan 2026 01:42:53 -0800


zhiqiang-hhhh opened a new pull request, #60358:
URL: https://github.com/apache/doris/pull/60358


   ### 1. Background of the Problem
   
   In Doris's ANN (Approximate Nearest Neighbor) index implementation, 
quantizers (such as PQ - Product Quantization and SQ - Scalar Quantization) 
require sufficient training data to build effective indexes. If the training 
data is insufficient, index construction may fail or be skipped, leading to 
degraded query performance or errors.
   
   Initially, the code only considered the minimum training rows for the IVF 
(Inverted File) index type (based on the number of clusters `nlist`), but did 
not adequately account for the quantizer's own requirements. Additionally, the 
test suite quantizer_min_train_rows.groovy only covered the IVF index type and 
did not include the HNSW (Hierarchical Navigable Small World) index type, which 
also supports quantizers.
   
   The user discovered:
   - The HNSW index type was set as the default in the code, but its minimum 
training rows calculation was incomplete.
   - The regression tests lacked test cases for HNSW, resulting in incomplete 
coverage.
   - The formulas for minimum training rows for PQ and SQ needed adjustment 
based on FAISS implementation to ensure accuracy.
   
   This could cause tests to fail or inaccurately validate index construction 
logic.
   
   ### 2. How It Was Resolved
   
   #### a. Modified the `get_min_train_rows()` Function in faiss_ann_index.cpp
   - **Original Logic**: Only returned IVF's `nlist` or 0 (for HNSW).
   - **New Logic**:
     - Calculate IVF minimum rows: If `index_type == IVF`, then `ivf_min = 
ivf_nlist`; otherwise, `ivf_min = 0`.
     - Calculate quantizer minimum rows:
       - For PQ: `pq_m * (1 << pq_nbits) * 256` (based on FAISS's codebook 
count and empirical multiplier).
       - For SQ: Fixed at 20 (empirical value, applicable to both IVF and HNSW).
     - Return `std::max(ivf_min, quantizer_min)` to ensure both IVF and 
quantizer requirements are met.
   - **Purpose**: Enable HNSW to correctly check quantizer training data, 
preventing the construction of invalid indexes when data is insufficient.
   
   #### b. Updated the Regression Test quantizer_min_train_rows.groovy
   - **Added HNSW Test Cases**:
     - HNSW PQ insufficient: Insert 3 rows (< 131072), verify index 
construction is skipped.
     - HNSW SQ insufficient: Insert 3 rows (< 20), verify index construction is 
skipped.
     - HNSW PQ sufficient: Insert 2048 rows (>= 2048), verify index 
construction succeeds.
     - HNSW SQ sufficient: Insert 12 rows (>= 20), verify index construction 
succeeds.
   - **Adjusted Existing IVF Tests**:
     - Updated PQ sufficient test: Insert 2048 rows (previously incorrectly 
inserted 200 rows, causing test failure).
     - Updated comments: Changed from `nlist * 2` to `20` for SQ, and clarified 
PQ formula as `pq_m * (1 << pq_nbits) * 256`.
   - **Cleanup**: Added drop statements for HNSW tables.
   - **Purpose**: Ensure tests cover all index types and quantizer 
combinations, validating that data insertion succeeds but index construction is 
skipped or performed based on min_train_rows.
   
   #### c. Verification
   - Ran regression tests: Confirmed all test cases pass (exit code 0), with 
normal data insertion and correct index construction logic.
   
   ### 3. Notes
   
   - **Compatibility**: Changes only affect ANN index training data checks and 
do not impact existing queries or index usage. Ensure sufficient data volume in 
production to avoid index construction failures.
   - **Performance Impact**: Minimum training rows checks occur during index 
construction and do not affect query performance. If data is insufficient, 
indexes are not built, and users must insert more data and rebuild manually.
   - **Test Coverage**: New HNSW tests ensure comprehensive coverage. Note 
parameter differences between IVF and HNSW (HNSW lacks `nlist` but has 
`max_degree` and `ef_construction`).
   - **Formula Accuracy**: PQ formula is based on FAISS codebook count (`pq_m * 
2^pq_nbits`) multiplied by empirical factor 256. SQ's 20 is an empirical value 
and may need adjustment based on actual FAISS behavior.
   - **Error Handling**: If training data is insufficient, the code skips index 
construction and logs it, without throwing exceptions, ensuring data insertion 
succeeds.
   - **Version Dependencies**: These changes rely on FAISS library 
implementation; ensure third-party library versions are consistent.
   
   ### 4. How min_train_rows Is Calculated
   
   `min_train_rows` is calculated via `FaissVectorIndex::get_min_train_rows()`, 
returning the larger value of IVF and quantizer requirements:
   
   - **IVF Minimum Rows (`ivf_min`)**:
     - If `index_type == IVF`, then `ivf_min = ivf_nlist` (number of clusters).
     - Otherwise, `ivf_min = 0` (no such requirement for HNSW).
   
   - **Quantizer Minimum Rows (`quantizer_min`)**:
     - If `quantizer == PQ`: `quantizer_min = pq_m * (1 << pq_nbits) * 256`
       - `pq_m`: Number of sub-quantizers.
       - `pq_nbits`: Bits per sub-quantizer, `1 << pq_nbits` is the codebook 
size (2^pq_nbits).
       - 256: Empirical multiplier to ensure sufficient samples per codebook.
     - If `quantizer == SQ4` or `SQ8`: `quantizer_min = 20` (fixed empirical 
value).
     - For other quantizers (e.g., FLAT): `quantizer_min = 0`.
   
   - **Final Result**: `std::max(ivf_min, quantizer_min)`.
   
   **Examples**:
   - IVF + PQ (pq_m=2, pq_nbits=8, nlist=10): max(10, 2*256*256) = max(10, 
131072) = 131072.
   - HNSW + SQ: max(0, 20) = 20.
   
   This calculation ensures sufficient training data during index construction 
to avoid invalid indexes.
   
   ### Check List (For Author)
   
   - Test <!-- At least one of them must be included. -->
       - [ ] Regression test
       - [ ] Unit Test
       - [ ] Manual test (add detailed scripts or steps below)
       - [ ] No need to test or manual test. Explain why:
           - [ ] This is a refactor/code format and no logic has been changed.
           - [ ] Previous test can cover this change.
           - [ ] No code files have been changed.
           - [ ] Other reason <!-- Add your reason?  -->
   
   - Behavior changed:
       - [ ] No.
       - [ ] Yes. <!-- Explain the behavior change -->
   
   - Does this need documentation?
       - [ ] No.
       - [ ] Yes. <!-- Add document PR link here. eg: 
https://github.com/apache/doris-website/pull/1214 -->
   
   ### Check List (For Reviewer who merge this PR)
   
   - [ ] Confirm the release note
   - [ ] Confirm test cases
   - [ ] Confirm document
   - [ ] Add branch pick label <!-- Add branch pick label that this PR should 
merge into -->
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [WIP](Ann) Cancel index building if input rows is less than the min_train_rows [doris]

Reply via email to