github-actions[bot] commented on code in PR #64082:
URL: https://github.com/apache/doris/pull/64082#discussion_r3370497331
##########
be/src/storage/index/ann/ann_index_writer.cpp:
##########
@@ -109,26 +111,10 @@ Status AnnIndexColumnWriter::add_array_values(size_t
field_size, const void* val
const float* p = reinterpret_cast<const float*>(value_ptr);
- const size_t full_elements = AnnIndexColumnWriter::chunk_size() * dim;
- size_t remaining_elements = num_rows * dim;
- size_t src_offset = 0;
- while (remaining_elements > 0) {
- size_t available_space = full_elements - _float_array.size();
- size_t elements_to_add = std::min(remaining_elements, available_space);
-
- _float_array.insert(_float_array.end(), p + src_offset, p + src_offset
+ elements_to_add);
- src_offset += elements_to_add;
- remaining_elements -= elements_to_add;
-
- if (_float_array.size() == full_elements) {
- RETURN_IF_ERROR(
- _vector_index->train(AnnIndexColumnWriter::chunk_size(),
_float_array.data()));
- RETURN_IF_ERROR(
- _vector_index->add(AnnIndexColumnWriter::chunk_size(),
_float_array.data()));
- _float_array.clear();
- _need_save_index = true;
- }
- }
+ // The offsets check above guarantees every array row matches the ANN
index dimension.
+ DCHECK(p != nullptr);
+ _buffered_vectors.insert(_buffered_vectors.end(), p, p + num_rows * dim);
+ _total_rows += cast_set<int64_t>(num_rows);
Review Comment:
This changes the ANN writer from the old bounded chunk buffer
(`ann_index_build_add_chunk_size * dim`) to retaining every vector in the
segment until `finish()`. That makes the allocation `segment_rows * dim *
sizeof(float)` in `_buffered_vectors`, and ANN `dim` is only validated as
positive while segment splitting is not based on this buffer. For example, a
high-dimensional ANN column can accumulate hundreds of MB or GB in this
PODArray before FAISS train/add runs, and this allocation is not reserved
against a Doris `MemTracker`. This reintroduces the OOM risk the PR is trying
to avoid, just during load/append instead of `init()`. Please keep the training
input bounded/tracked (or enforce a byte cap/reservation and fail cleanly)
instead of unconditionally buffering the full segment.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]