github-actions[bot] commented on code in PR #60358:
URL: https://github.com/apache/doris/pull/60358#discussion_r2876590603


##########
be/src/olap/rowset/segment_v2/ann_index/ann_index_iterator.cpp:
##########
@@ -38,6 +49,8 @@ Status AnnIndexIterator::read_from_index(const IndexParam& 
param) {
 
     // _context may be unset in some test scenarios; pass nullptr IOContext in 
that case.
     io::IOContext* io_ctx = (_context != nullptr) ? _context->io_ctx : nullptr;
+    LOG_INFO("_context of ann index iterator is {}", (_context != nullptr) ? 
"not null" : "null");

Review Comment:
   **Remove this LOG_INFO.** This fires on every ANN query call and provides no 
diagnostic value in production. It was likely added for debugging.



##########
be/src/olap/rowset/segment_v2/index_file_reader.cpp:
##########
@@ -42,6 +43,10 @@ Status IndexFileReader::init(int32_t read_buffer_size, const 
io::IOContext* io_c
 Status IndexFileReader::_init_from(int32_t read_buffer_size, const 
io::IOContext* io_ctx) {
     auto index_file_full_path = 
InvertedIndexDescriptor::get_index_file_path_v2(_index_path_prefix);
 
+    LOG(INFO) << "[DEBUG] IndexFileReader::_init_from start, 
index_path_prefix: "
+              << _index_path_prefix << ", index_file_full_path: " << 
index_file_full_path
+              << ", read_buffer_size: " << read_buffer_size;

Review Comment:
   **Critical: Remove debug logging before merge.** This file has ~15 `[DEBUG] 
LOG(INFO)` statements that will fire on every index file read (not just ANN 
index). This is the hot path for all inverted index operations and will flood 
production logs. These should be removed entirely or downgraded to `VLOG_DEBUG`.



##########
be/src/olap/rowset/segment_v2/ann_index/ann_index_reader.cpp:
##########
@@ -87,23 +89,30 @@ Status AnnIndexReader::load_index(io::IOContext* io_ctx) {
             _vector_index->set_type(_index_type);
             RETURN_IF_ERROR(_vector_index->load(compound_dir->get()));
         } catch (CLuceneError& err) {
+            LOG_ERROR("Failed to load ann index: {}", err.what());
             return Status::Error<ErrorCode::INVERTED_INDEX_CLUCENE_ERROR>(
                     "CLuceneError occur when open ann idx file, error msg: 
{}", err.what());
         }
         return Status::OK();
     });
 }
 
-Status AnnIndexReader::query(io::IOContext* io_ctx, AnnTopNParam* param, 
AnnIndexStats* stats) {
+bool AnnIndexReader::try_load_index(io::IOContext* io_ctx) {
 #ifndef BE_TEST
-    {
-        SCOPED_TIMER(&(stats->load_index_costs_ns));
-        RETURN_IF_ERROR(load_index(io_ctx));
-        double load_costs_ms = 
static_cast<double>(stats->load_index_costs_ns.value()) / 1000.0;
-        DorisMetrics::instance()->ann_index_load_costs_ms->increment(
-                static_cast<int64_t>(load_costs_ms));
+    Status st = load_index(io_ctx);

Review Comment:
   **Lost metrics tracking.** The original code had:
   ```cpp
   SCOPED_TIMER(&(stats->load_index_costs_ns));
   RETURN_IF_ERROR(load_index(io_ctx));
   double load_costs_ms = ...
   DorisMetrics::instance()->ann_index_load_costs_ms->increment(...);
   ```
   By moving load to `try_load_index()` (which has no `stats` parameter), both 
`load_index_costs_ns` timing and `ann_index_load_costs_ms` metrics are 
completely lost. You should either pass the stats object to `try_load_index()` 
or record the timing at the call site in `segment_iterator.cpp`.



##########
be/src/olap/rowset/segment_v2/ann_index/faiss_ann_index.cpp:
##########
@@ -291,6 +291,31 @@ doris::Status FaissVectorIndex::add(vectorized::Int64 n, 
const float* vec) {
     return doris::Status::OK();
 }
 
+vectorized::Int64 FaissVectorIndex::get_min_train_rows() const {
+    // For IVF indexes, the minimum number of training points should be at 
least
+    // equal to the number of clusters (nlist). FAISS requires this for 
k-means clustering.
+    vectorized::Int64 ivf_min = 0;
+    if (_params.index_type == FaissBuildParameter::IndexType::IVF) {
+        ivf_min = _params.ivf_nlist;
+    }
+
+    // Calculate minimum training rows required by the quantizer
+    vectorized::Int64 quantizer_min = 0;
+    if (_params.quantizer == FaissBuildParameter::Quantizer::PQ) {
+        // For PQ, we need to make sure each sub-quantizer has enough training 
vectors.
+        // See code from contrib/faiss/faiss/impl/ProductQuantizer.cpp::65
+        quantizer_min = (1LL << _params.pq_nbits) * 100;
+    } else if (_params.quantizer == FaissBuildParameter::Quantizer::SQ4 ||
+               _params.quantizer == FaissBuildParameter::Quantizer::SQ8) {
+        // For SQ, use a minimum of 20 training vectors, similar to IVF's 
nlist * 2 with nlist=10
+        quantizer_min = 1;

Review Comment:
   **Comment/code mismatch.** Comment says "use a minimum of 20 training 
vectors, similar to IVF's nlist * 2 with nlist=10" but the code sets 
`quantizer_min = 1`. Should this be `quantizer_min = 20`? Or if 1 is correct, 
the comment should be updated.



##########
be/src/olap/rowset/segment_v2/ann_index/ann_index_writer.cpp:
##########
@@ -151,16 +152,50 @@ int64_t AnnIndexColumnWriter::size() const {
 }
 
 Status AnnIndexColumnWriter::finish() {
+    vectorized::Int64 min_train_rows = _vector_index->get_min_train_rows();
+
+    // Check if we have enough rows to train the index
     // train/add the remaining data
-    if (!_float_array.empty()) {
+    if (_float_array.empty()) {
+        if (_need_save_index) {
+            return _vector_index->save(_dir.get());
+        } else {
+            // No data was added at all. This can happen if the segment has 0 
rows
+            // or all rows were filtered out. We need to delete the directory 
entry
+            // to avoid writing an empty/invalid index file.
+            LOG_INFO("No data to train/add for ANN index. Skipping index 
building.");
+            return _index_file_writer->delete_index(_index_meta);
+        }
+    } else {
         DCHECK(_float_array.size() % _vector_index->get_dimension() == 0);
         vectorized::Int64 num_rows = _float_array.size() / 
_vector_index->get_dimension();
-        RETURN_IF_ERROR(_vector_index->train(num_rows, _float_array.data()));
-        RETURN_IF_ERROR(_vector_index->add(num_rows, _float_array.data()));
-        _float_array.clear();
+
+        if (num_rows >= min_train_rows) {
+            RETURN_IF_ERROR(_vector_index->train(num_rows, 
_float_array.data()));
+            RETURN_IF_ERROR(_vector_index->add(num_rows, _float_array.data()));
+            return _vector_index->save(_dir.get());
+        } else {
+            // It happens to have not enough data to train.
+            // If we have data to add before, we still need to save the index.
+            if (_need_save_index) {
+                RETURN_IF_ERROR(_vector_index->add(num_rows, 
_float_array.data()));
+                return _vector_index->save(_dir.get());
+            } else {
+                // Not enough data to train and no data added before.
+                // Means this is a very small segment, we can skip the index 
building.
+                // We need to delete the directory entry from 
index_file_writer to avoid
+                // writing an empty/invalid index file which causes 
"IndexInput read past EOF" error.
+                LOG_INFO(
+                        "Remaining data size {} is less than minimum {} rows 
required for ANN "
+                        "index "
+                        "training. Skipping index building for this segment.",
+                        num_rows, min_train_rows);
+                return _index_file_writer->delete_index(_index_meta);
+            }
+        }
     }
 
-    return _vector_index->save(_dir.get());
+    return Status::OK();

Review Comment:
   **Dead code.** This `return Status::OK()` is unreachable because every 
branch in the `if (_float_array.empty()) ... else ...` block above already 
returns. Consider removing it to avoid confusion.



##########
be/src/olap/rowset/segment_v2/ann_index/ann_index_writer.cpp:
##########
@@ -151,16 +152,50 @@ int64_t AnnIndexColumnWriter::size() const {
 }
 
 Status AnnIndexColumnWriter::finish() {
+    vectorized::Int64 min_train_rows = _vector_index->get_min_train_rows();
+
+    // Check if we have enough rows to train the index
     // train/add the remaining data
-    if (!_float_array.empty()) {
+    if (_float_array.empty()) {
+        if (_need_save_index) {
+            return _vector_index->save(_dir.get());
+        } else {
+            // No data was added at all. This can happen if the segment has 0 
rows
+            // or all rows were filtered out. We need to delete the directory 
entry
+            // to avoid writing an empty/invalid index file.
+            LOG_INFO("No data to train/add for ANN index. Skipping index 
building.");
+            return _index_file_writer->delete_index(_index_meta);
+        }
+    } else {
         DCHECK(_float_array.size() % _vector_index->get_dimension() == 0);
         vectorized::Int64 num_rows = _float_array.size() / 
_vector_index->get_dimension();
-        RETURN_IF_ERROR(_vector_index->train(num_rows, _float_array.data()));
-        RETURN_IF_ERROR(_vector_index->add(num_rows, _float_array.data()));
-        _float_array.clear();
+
+        if (num_rows >= min_train_rows) {
+            RETURN_IF_ERROR(_vector_index->train(num_rows, 
_float_array.data()));
+            RETURN_IF_ERROR(_vector_index->add(num_rows, _float_array.data()));
+            return _vector_index->save(_dir.get());
+        } else {
+            // It happens to have not enough data to train.
+            // If we have data to add before, we still need to save the index.
+            if (_need_save_index) {
+                RETURN_IF_ERROR(_vector_index->add(num_rows, 
_float_array.data()));

Review Comment:
   **Potential correctness concern:** When `_need_save_index == true` (previous 
chunks were trained+added) but the remaining rows are fewer than 
`min_train_rows`, you call `add()` without `train()`. For IVF indexes, this 
means the remaining vectors are added to an index whose quantizer was trained 
on previous batches. This is generally fine for FAISS since the quantizer is 
already trained, but it means these vectors won't benefit from the last batch's 
clustering. Worth a comment explaining this is intentional.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to