Re: [PR] [improvement](be) Train ANN index once instead of on every chunk [doris]

via GitHub Fri, 05 Jun 2026 02:06:00 -0700


github-actions[bot] commented on code in PR #64145:
URL: https://github.com/apache/doris/pull/64145#discussion_r3361597742



##########
be/src/storage/index/ann/ann_index_writer.cpp:
##########
@@ -108,28 +116,7 @@ Status AnnIndexColumnWriter::add_array_values(size_t 
field_size, const void* val
     }
 
     const float* p = reinterpret_cast<const float*>(value_ptr);
-
-    const size_t full_elements = AnnIndexColumnWriter::chunk_size() * dim;
-    size_t remaining_elements = num_rows * dim;
-    size_t src_offset = 0;
-    while (remaining_elements > 0) {
-        size_t available_space = full_elements - _float_array.size();
-        size_t elements_to_add = std::min(remaining_elements, available_space);
-
-        _float_array.insert(_float_array.end(), p + src_offset, p + src_offset 
+ elements_to_add);
-        src_offset += elements_to_add;
-        remaining_elements -= elements_to_add;
-
-        if (_float_array.size() == full_elements) {
-            RETURN_IF_ERROR(
-                    _vector_index->train(AnnIndexColumnWriter::chunk_size(), 
_float_array.data()));
-            RETURN_IF_ERROR(
-                    _vector_index->add(AnnIndexColumnWriter::chunk_size(), 
_float_array.data()));
-            _float_array.clear();
-            _need_save_index = true;
-        }
-    }
-
+    _float_array.insert(_float_array.end(), p, p + num_rows * dim);
     return Status::OK();

Review Comment:
   This removes the only bound on `_float_array`: every `add_array_values()` 
call now appends into one buffer and nothing is flushed until `finish()`. For 
ANN dimensions this can become very large per segment/build path (load, 
compaction, schema change, build-index-on-existing-data); e.g. `rows * dim * 
sizeof(float)` is no longer capped by `ann_index_build_chunk_size`. The old 
code reserved/flushed one chunk, but after this line the config is effectively 
ignored and memory is not admitted/tracked before the growth. Please keep 
bounded buffering (or add explicit sampling/training-buffer admission plus a 
separate bounded add path) rather than retaining the full segment in memory.



##########
regression-test/suites/ann_index_p0/ann_ivf_pq_train_once_recall.groovy:
##########
@@ -0,0 +1,191 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// Regression for #63913: the ANN index writer must train the FAISS quantizer
+// EXACTLY ONCE per index build, not once per buffered chunk.
+//
+// Background: AnnIndexColumnWriter buffers vectors and flushes them one chunk 
at
+// a time (ann_index_build_chunk_size). The buggy code called train() on every
+// chunk. For a PQ (product-quantization) index, train() re-fits the codebook 
on
+// the latest chunk, but vectors from earlier chunks were already add()ed and
+// encoded under the previous codebook. After the final chunk re-trains, those
+// earlier codes no longer match the stored codebook, so they decode to garbage
+// distances at query time -> recall collapses on any segment that spans more
+// than one chunk.
+//
+// This test shrinks ann_index_build_chunk_size so a single 20k-row segment 
spans
+// 10 chunks, builds an IVF+PQ index, and asserts recall@10 (vs exact 
brute-force
+// l2_distance) stays high. On a buggy BE this recall drops to ~0.1 and the 
test
+// fails; on the fixed BE it stays high. An IVF+FLAT table loaded from the same
+// data is used as a positive control (FLAT has no codebook, so it is 
unaffected
+// and must reach near-exact recall) -- this proves the harness can achieve 
high
+// recall, so a low PQ recall is specifically the train-reentry bug.
+//
+// nonConcurrent: it temporarily changes a global BE config.
+suite("ann_ivf_pq_train_once_recall", "nonConcurrent") {
+    def dim       = 32
+    def nRows     = 20000
+    def chunkSize = 2000      // 20000 / 2000 = 10 chunks per segment -> bug 
triggers hard
+    def nlist     = 64
+    def topk      = 10
+    def nQueries  = 30
+    def rnd       = new Random(42)   // fixed seed -> reproducible
+
+    // -- generate i.i.d. gaussian base vectors as a stream-load CSV 
(id|[v0,...]) --
+    // The vector itself contains commas, so use '|' as the column separator.
+    def sb = new StringBuilder()
+    for (int i = 0; i < nRows; i++) {
+        sb.append(i).append('|').append('[')
+        for (int d = 0; d < dim; d++) {
+            float v = (float) rnd.nextGaussian()
+            if (d > 0) sb.append(',')
+            sb.append(String.format(Locale.US, '%.6f', v))   // Locale.US: 
never a comma decimal
+        }
+        sb.append(']').append('\n')
+    }
+    def csv = sb.toString()
+
+    // -- query vectors (independent random) --
+    def queries = new float[nQueries][dim]
+    for (int q = 0; q < nQueries; q++) {
+        for (int d = 0; d < dim; d++) {
+            queries[q][d] = (float) rnd.nextGaussian()
+        }
+    }
+
+    def vecLiteral = { float[] v ->
+        def s = new StringBuilder('[')
+        for (int d = 0; d < v.length; d++) {
+            if (d > 0) s.append(',')
+            s.append(String.format(Locale.US, '%.6f', v[d]))
+        }
+        s.append(']')
+        return s.toString()
+    }
+
+    def idsOf = { String q ->
+        def rows = sql q
+        return rows.collect { (it[0] as long) } as Set
+    }
+
+    // recall@topk averaged over all queries: approx (uses index) vs exact 
(brute force)
+    def measureRecall = { String table ->
+        double total = 0.0d
+        for (int q = 0; q < nQueries; q++) {
+            def lit    = vecLiteral(queries[q])
+            def approx = idsOf("select id from ${table} order by 
l2_distance_approximate(vec, ${lit}) limit ${topk}".toString())
+            def exact  = idsOf("select id from ${table} order by 
l2_distance(vec, ${lit}) limit ${topk}".toString())
+            total += (approx.intersect(exact).size() / (double) topk)
+        }
+        return total / nQueries
+    }
+
+    def loadCsv = { String table ->
+        streamLoad {
+            table "${table}"
+            set 'column_separator', '|'
+            set 'columns', 'id, vec'
+            inputStream new ByteArrayInputStream(csv.getBytes("UTF-8"))
+            time 120000
+            check { result, exception, startTime, endTime ->
+                if (exception != null) {
+                    throw exception
+                }
+                def json = parseJson(result)
+                assertEquals("success", json.Status.toLowerCase())
+                assertEquals(nRows, json.NumberLoadedRows)
+                assertEquals(0, json.NumberFilteredRows)
+            }
+        }
+    }
+
+    sql "set enable_common_expr_pushdown = true"
+    sql "set enable_ann_index_result_cache = false"   // avoid cache masking 
real index behavior
+    // Scan all lists so IVF coarse-quantization adds no approximation: this 
isolates
+    // PQ-codebook correctness, which is what the bug breaks.
+    sql "set ivf_nprobe = ${nlist}"
+
+    setBeConfigTemporary([ann_index_build_chunk_size: chunkSize]) {

Review Comment:
   This regression no longer tests what its comments claim. The final 
`AnnIndexColumnWriter::add_array_values()` ignores `ann_index_build_chunk_size` 
and buffers all rows until `finish()`, so lowering this config does not create 
10 writer flush chunks or exercise the former retrain-on-each-chunk path. As 
written, the test can pass while the chunk-size behavior is dead/stale. Please 
either restore chunked flushing and keep this as the multi-chunk regression, or 
rewrite the test/comment/config usage around the final all-at-finish behavior.



##########
be/test/storage/index/ann/ann_index_writer_test.cpp:
##########
@@ -427,11 +428,13 @@ TEST_F(AnnIndexWriterTest, TestAddMoreThanChunkSize) {
     ASSERT_TRUE(writer->init().ok());
     writer->set_vector_index(mock_index);
 
+    // Index requires training (e.g. IVF/PQ). After the fix train() is invoked 
only
+    // once on the first full chunk; the remainder reuses the already-trained 
state.
+    EXPECT_CALL(*mock_index, 
needs_training()).WillRepeatedly(testing::Return(true));

Review Comment:
   These expectations still describe the pre-final chunked behavior. With the 
final writer implementation, the 12 rows accumulated by `add_array_values()` 
are not flushed at chunk size 10; `finish()` calls `_train_once_if_needed(12, 
...)` and `add(12, ...)`, not `train(10)`, `add(10)`, and `add(2)`. The same 
stale pattern appears in the later chunk/remainder tests 
(`TestAddMoreThanChunkSizeIVF`, `TestLargeDataVolume*`, `TestPQMinTrainRows`, 
`TestPQWithSufficientData`, `TestHnswSkipsTrainCall`), so the focused BE UTs 
will fail once the test binary can be built. Please update the tests to match 
the final behavior or restore chunked add semantics.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [improvement](be) Train ANN index once instead of on every chunk [doris]

Reply via email to