github-actions[bot] commented on code in PR #64082:
URL: https://github.com/apache/doris/pull/64082#discussion_r3370497331


##########
be/src/storage/index/ann/ann_index_writer.cpp:
##########
@@ -109,26 +111,10 @@ Status AnnIndexColumnWriter::add_array_values(size_t 
field_size, const void* val
 
     const float* p = reinterpret_cast<const float*>(value_ptr);
 
-    const size_t full_elements = AnnIndexColumnWriter::chunk_size() * dim;
-    size_t remaining_elements = num_rows * dim;
-    size_t src_offset = 0;
-    while (remaining_elements > 0) {
-        size_t available_space = full_elements - _float_array.size();
-        size_t elements_to_add = std::min(remaining_elements, available_space);
-
-        _float_array.insert(_float_array.end(), p + src_offset, p + src_offset 
+ elements_to_add);
-        src_offset += elements_to_add;
-        remaining_elements -= elements_to_add;
-
-        if (_float_array.size() == full_elements) {
-            RETURN_IF_ERROR(
-                    _vector_index->train(AnnIndexColumnWriter::chunk_size(), 
_float_array.data()));
-            RETURN_IF_ERROR(
-                    _vector_index->add(AnnIndexColumnWriter::chunk_size(), 
_float_array.data()));
-            _float_array.clear();
-            _need_save_index = true;
-        }
-    }
+    // The offsets check above guarantees every array row matches the ANN 
index dimension.
+    DCHECK(p != nullptr);
+    _buffered_vectors.insert(_buffered_vectors.end(), p, p + num_rows * dim);
+    _total_rows += cast_set<int64_t>(num_rows);

Review Comment:
   This changes the ANN writer from the old bounded chunk buffer 
(`ann_index_build_add_chunk_size * dim`) to retaining every vector in the 
segment until `finish()`. That makes the allocation `segment_rows * dim * 
sizeof(float)` in `_buffered_vectors`, and ANN `dim` is only validated as 
positive while segment splitting is not based on this buffer. For example, a 
high-dimensional ANN column can accumulate hundreds of MB or GB in this 
PODArray before FAISS train/add runs, and this allocation is not reserved 
against a Doris `MemTracker`. This reintroduces the OOM risk the PR is trying 
to avoid, just during load/append instead of `init()`. Please keep the training 
input bounded/tracked (or enforce a byte cap/reservation and fail cleanly) 
instead of unconditionally buffering the full segment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to