zhiqiang-hhhh commented on code in PR #57623:
URL: https://github.com/apache/doris/pull/57623#discussion_r2502584246


##########
be/src/olap/rowset/segment_v2/ann_index/ann_index_writer.cpp:
##########
@@ -104,10 +104,28 @@ Status AnnIndexColumnWriter::add_array_values(size_t 
field_size, const void* val
     }
 
     const float* p = reinterpret_cast<const float*>(value_ptr);
+
+    // Add to buffer
+    _ann_vec.insert(_ann_vec.end(), p, p + num_rows * dim);
+    DCHECK(_ann_vec.size() % dim == 0);
+
     // Train the index if needed
     // Faiss index will do nothing if index does not need train.
-    RETURN_IF_ERROR(_vector_index->train(num_rows, p));
-    RETURN_IF_ERROR(_vector_index->add(num_rows, p));
+    vectorized::Int64 chunk_size = 1'000'000;
+    vectorized::Int64 i = 0;
+
+    // train/add chunk_size rows once. if the remaining rows are less than 
chunk_size, nothing will happen.
+    // In finish(), all remaining data will be trained and added.
+    for (; i + chunk_size < _ann_vec.size() / dim; i += chunk_size) {
+        RETURN_IF_ERROR(_vector_index->train(chunk_size, _ann_vec.data() + i * 
dim));
+        RETURN_IF_ERROR(_vector_index->add(chunk_size, _ann_vec.data() + i * 
dim));
+    }
+
+    if (i > 0) {
+        vectorized::Int64 offset = i * dim;
+        std::copy(_ann_vec.begin() + offset, _ann_vec.end(), _ann_vec.begin());

Review Comment:
   cost of memory copy can be optimized by using 
`std::list<std::shared_ptr<DorisVector>>`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to