This is an automated email from the ASF dual-hosted git repository.
airborne pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git
The following commit(s) were added to refs/heads/master by this push:
new 62d4ab47c6c [feature](ann-index) Support IVF on-disk index type for
ANN vector search (#61160)
62d4ab47c6c is described below
commit 62d4ab47c6c664e241615576eb08bc92e4439150
Author: zhiqiang <[email protected]>
AuthorDate: Wed Apr 1 13:40:05 2026 +0800
[feature](ann-index) Support IVF on-disk index type for ANN vector search
(#61160)
## Summary
This PR introduces a new ANN index type `ivf_on_disk` that stores IVF
inverted list data on disk instead of fully loading it into memory. This
enables vector search on datasets that exceed available memory, with a
dedicated LRU cache for frequently accessed IVF list pages.
## Motivation
The existing `ivf` index type loads the entire IVF index (including all
inverted list codes and IDs) into memory during search. For large-scale
vector datasets (billions of vectors), this makes the memory footprint
prohibitively expensive. The `ivf_on_disk` approach stores the inverted
list data in a separate file (`ann.ivfdata`) and reads only the lists
needed for each query, backed by an LRU cache for hot data.
## Changes
### BE - Core IVF On-Disk Implementation
**New index type `IVF_ON_DISK`:**
- Extended `AnnIndexType` enum with `IVF_ON_DISK` and added string
conversion support (`be/src/storage/index/ann/ann_index.h`,
`ann_index.cpp`)
- Extended `FaissBuildParameter::IndexType` with `IVF_ON_DISK`
(`be/src/storage/index/ann/faiss_ann_index.h`)
- Added `faiss_ivfdata_file_name` constant (`ann.ivfdata`) for the
separate data file (`be/src/storage/index/ann/ann_index_files.h`)
**On-disk save/load in `FaissVectorIndex` (`faiss_ann_index.cpp`):**
- **Save path**: Converts in-memory `ArrayInvertedLists` to
`OnDiskInvertedLists` format, writes list data to `ann.ivfdata` and
index metadata to `ann.faiss`
- **Load path**: Reads `ann.ivfdata` via a `CachedRandomAccessReader`
backed by an LRU cache; replaces the deserialized `PreadInvertedLists`
with a cached reader that provides zero-copy `borrow()` for repeated
list accesses
- Introduced `CachedRandomAccessReader` implementing
`faiss::RandomAccessReader` with per-range LRU caching keyed by
`(file-prefix, file-size, byte-offset)`
**Dedicated IVF list cache
(`be/src/storage/cache/ann_index_ivf_list_cache.h/cpp`):**
- New `AnnIndexIVFListCache` class — a dedicated LRU cache separated
from `StoragePageCache` to avoid contention with column data pages
- Configurable capacity via `ann_index_ivf_list_cache_limit` (default:
70% of physical memory)
- Registered in `CachePolicy` as `ANN_INDEX_IVF_LIST_CACHE`
**Runtime environment integration:**
- `ExecEnv` now creates/destroys the `AnnIndexIVFListCache` singleton
- Config entries: `ann_index_ivf_list_cache_limit`,
`ann_index_ivf_list_cache_stale_sweep_time_sec`
**Metrics & profiling:**
- Added 6 new metrics: `ann_ivf_on_disk_fetch_page_costs_ms`,
`ann_ivf_on_disk_fetch_page_cnt`, `ann_ivf_on_disk_search_costs_ms`,
`ann_ivf_on_disk_search_cnt`, `ann_ivf_on_disk_cache_hit_cnt`,
`ann_ivf_on_disk_cache_miss_cnt`
- Extended `AnnIndexStats` and `OlapReaderStatistics` with IVF on-disk
counters
- Propagated stats through `AnnIndexReader::query()` and
`range_search()` to `SegmentIterator`
**Search execution:**
- `AnnIndexReader` now handles `IVF_ON_DISK` alongside `IVF` for both
top-N and range search paths
- `ScopedIoCtxBinding` propagates `IOContext` via `thread_local` so
`CachedRandomAccessReader` can attribute file-cache stats to the correct
query
- `ScopedOmpThreadBudget` now uses `condition_variable` to properly
block waiting index builders instead of silently degrading
**Index file writer fix:**
- `IndexFileWriter::add_into_searcher_cache()` now correctly skips ANN
indexes (both single-file HNSW/IVF and two-file IVF_ON_DISK) by checking
for `ann.faiss`/`ann.ivfdata` file names
**Compound directory lifetime:**
- `AnnIndexReader` now holds `_compound_dir` alive to prevent
use-after-free when `CachedRandomAccessReader` holds a cloned
`CSIndexInput` whose base pointer references the compound reader's
stream
### FE - DDL Validation
- `AnnIndexPropertiesChecker.java`: Accept `ivf_on_disk` as a valid
index type; require `nlist` for both `ivf` and `ivf_on_disk`
### FAISS Submodule
- Updated `contrib/faiss` submodule to a version supporting
`PreadInvertedLists` with `RandomAccessReader`/`borrow()` interface
### Regression Tests
- `ivf_on_disk_index_test.groovy`: Tests for L2 distance, inner product,
missing nlist error, insufficient training points, larger datasets,
range search
- `create_ann_index_test.groovy`: Added `ivf_on_disk` CREATE INDEX test
case
- `create_tbl_with_ann_index_test.groovy`: Added CREATE TABLE with
`ivf_on_disk` (L2, IP, missing nlist error)
- Test data files: `ivf_on_disk_stream_load.csv`,
`ivf_on_disk_stream_load.json`, `ivf_on_disk_index_test.out`
## Usage
```sql
CREATE TABLE tbl (
id INT NOT NULL,
embedding ARRAY<FLOAT> NOT NULL,
INDEX idx_emb (`embedding`) USING ANN PROPERTIES(
"index_type" = "ivf_on_disk",
"metric_type" = "l2_distance",
"dim" = "128",
"nlist" = "128"
)
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 2
PROPERTIES ("replication_num" = "1");
-- Approximate nearest neighbor search
SELECT id, l2_distance_approximate(embedding, [1.0, 2.0, 3.0]) AS dist
FROM tbl ORDER BY dist LIMIT 10;
Known Limitations
- Stream load to ivf_on_disk tables currently fails during index building
(the FulltextIndexSearcherBuilder path does not support the two-file format
yet). This is covered by a regression test that asserts the current failure
behavior.
---
be/src/common/config.cpp | 8 +
be/src/common/config.h | 5 +
be/src/common/metrics/doris_metrics.cpp | 12 +
be/src/common/metrics/doris_metrics.h | 6 +
be/src/exec/operator/olap_scan_operator.cpp | 5 +
be/src/exec/operator/olap_scan_operator.h | 3 +
be/src/exec/scan/olap_scanner.cpp | 5 +
be/src/exec/scan/vector_search_user_params.h | 2 +-
be/src/runtime/exec_env.h | 6 +
be/src/runtime/exec_env_init.cpp | 16 +
be/src/runtime/memory/cache_manager.cpp | 9 +
be/src/runtime/memory/cache_policy.h | 6 +-
be/src/runtime/runtime_state.h | 9 +-
be/src/storage/cache/ann_index_ivf_list_cache.cpp | 61 +++
be/src/storage/cache/ann_index_ivf_list_cache.h | 95 +++++
be/src/storage/cache/page_cache.cpp | 15 +
be/src/storage/cache/page_cache.h | 5 +
be/src/storage/index/ann/ann_index.cpp | 4 +
be/src/storage/index/ann/ann_index.h | 2 +-
be/src/storage/index/ann/ann_index_files.h | 1 +
be/src/storage/index/ann/ann_index_reader.cpp | 61 ++-
be/src/storage/index/ann/ann_index_reader.h | 9 +
be/src/storage/index/ann/ann_search_params.h | 33 +-
be/src/storage/index/ann/ann_topn_runtime.cpp | 2 +-
be/src/storage/index/ann/faiss_ann_index.cpp | 441 +++++++++++++++++++--
be/src/storage/index/ann/faiss_ann_index.h | 20 +-
be/src/storage/index/index_file_reader.h | 3 +
be/src/storage/index/index_file_writer.cpp | 9 +-
be/src/storage/olap_common.h | 3 +
be/src/storage/segment/segment_iterator.cpp | 9 +
conf/ubsan_ignorelist.txt | 6 +
contrib/faiss | 2 +-
.../doris/analysis/AnnIndexPropertiesChecker.java | 9 +-
.../java/org/apache/doris/qe/SessionVariable.java | 2 +-
gensrc/thrift/PaloInternalService.thrift | 2 +-
.../data/ann_index_p0/ivf_on_disk_index_test.out | 29 ++
.../data/ann_index_p0/ivf_on_disk_stream_load.csv | 6 +
.../data/ann_index_p0/ivf_on_disk_stream_load.json | 8 +
.../ann_index_p0/create_ann_index_test.groovy | 24 +-
.../create_tbl_with_ann_index_test.groovy | 62 +++
.../suites/ann_index_p0/ivf_index_test.groovy | 2 +-
.../ann_index_p0/ivf_on_disk_index_test.groovy | 231 +++++++++++
.../ann_index_p0/ivf_on_disk_stream_load.csv | 6 +
43 files changed, 1198 insertions(+), 56 deletions(-)
diff --git a/be/src/common/config.cpp b/be/src/common/config.cpp
index 54a65f7802a..269e2d2e0cc 100644
--- a/be/src/common/config.cpp
+++ b/be/src/common/config.cpp
@@ -1727,6 +1727,14 @@ DEFINE_mInt32(max_segment_partial_column_cache_size,
"100");
DEFINE_mBool(enable_prefill_output_dbm_agg_cache_after_compaction, "true");
DEFINE_mBool(enable_prefill_all_dbm_agg_cache_after_compaction, "true");
+// Cache for ANN index IVF on-disk list data.
+// "70%" means 70% of the process available memory, not 70% of total machine
memory.
+// With default mem_limit="90%", this is effectively about 63% (90% * 70%) of
physical memory
+// visible to the process (considering cgroup limits).
+DEFINE_String(ann_index_ivf_list_cache_limit, "70%");
+// Stale sweep time for ANN index IVF list cache in seconds. 3600s is 1 hour.
+DEFINE_mInt32(ann_index_ivf_list_cache_stale_sweep_time_sec, "3600");
+
// Chunk size for ANN/vector index building per training/adding batch
// 1M By default.
DEFINE_mInt64(ann_index_build_chunk_size, "1000000");
diff --git a/be/src/common/config.h b/be/src/common/config.h
index 68a5f6dce1f..069483c4fb8 100644
--- a/be/src/common/config.h
+++ b/be/src/common/config.h
@@ -1780,6 +1780,11 @@ DECLARE_mInt64(max_csv_line_reader_output_buffer_size);
DECLARE_Int32(omp_threads_limit);
// The capacity of segment partial column cache, used to cache column readers
for each segment.
DECLARE_mInt32(max_segment_partial_column_cache_size);
+// Cache for ANN index IVF on-disk list data.
+// Default "70%" means 70% of total physical memory.
+DECLARE_String(ann_index_ivf_list_cache_limit);
+// Stale sweep time for ANN index IVF list cache in seconds.
+DECLARE_mInt32(ann_index_ivf_list_cache_stale_sweep_time_sec);
// Chunk size for ANN/vector index building per training/adding batch
DECLARE_mInt64(ann_index_build_chunk_size);
diff --git a/be/src/common/metrics/doris_metrics.cpp
b/be/src/common/metrics/doris_metrics.cpp
index f9e0e8ace47..81f3469dc2c 100644
--- a/be/src/common/metrics/doris_metrics.cpp
+++ b/be/src/common/metrics/doris_metrics.cpp
@@ -255,6 +255,12 @@
DEFINE_COUNTER_METRIC_PROTOTYPE_2ARG(ann_index_load_costs_ms, MetricUnit::MILLIS
DEFINE_COUNTER_METRIC_PROTOTYPE_2ARG(ann_index_load_cnt, MetricUnit::NOUNIT);
DEFINE_COUNTER_METRIC_PROTOTYPE_2ARG(ann_index_search_costs_ms,
MetricUnit::MILLISECONDS);
DEFINE_COUNTER_METRIC_PROTOTYPE_2ARG(ann_index_search_cnt, MetricUnit::NOUNIT);
+DEFINE_COUNTER_METRIC_PROTOTYPE_2ARG(ann_ivf_on_disk_fetch_page_costs_ms,
MetricUnit::MILLISECONDS);
+DEFINE_COUNTER_METRIC_PROTOTYPE_2ARG(ann_ivf_on_disk_fetch_page_cnt,
MetricUnit::NOUNIT);
+DEFINE_COUNTER_METRIC_PROTOTYPE_2ARG(ann_ivf_on_disk_search_costs_ms,
MetricUnit::MILLISECONDS);
+DEFINE_COUNTER_METRIC_PROTOTYPE_2ARG(ann_ivf_on_disk_search_cnt,
MetricUnit::NOUNIT);
+DEFINE_COUNTER_METRIC_PROTOTYPE_2ARG(ann_ivf_on_disk_cache_hit_cnt,
MetricUnit::NOUNIT);
+DEFINE_COUNTER_METRIC_PROTOTYPE_2ARG(ann_ivf_on_disk_cache_miss_cnt,
MetricUnit::NOUNIT);
DEFINE_COUNTER_METRIC_PROTOTYPE_2ARG(ann_index_in_memory_cnt,
MetricUnit::NOUNIT);
DEFINE_COUNTER_METRIC_PROTOTYPE_2ARG(ann_index_in_memory_rows_cnt,
MetricUnit::ROWS);
DEFINE_COUNTER_METRIC_PROTOTYPE_2ARG(ann_index_construction,
MetricUnit::NOUNIT);
@@ -427,6 +433,12 @@ DorisMetrics::DorisMetrics() :
_metric_registry(_s_registry_name) {
INT_COUNTER_METRIC_REGISTER(_server_metric_entity, ann_index_load_cnt);
INT_COUNTER_METRIC_REGISTER(_server_metric_entity,
ann_index_search_costs_ms);
INT_COUNTER_METRIC_REGISTER(_server_metric_entity, ann_index_search_cnt);
+ INT_COUNTER_METRIC_REGISTER(_server_metric_entity,
ann_ivf_on_disk_fetch_page_costs_ms);
+ INT_COUNTER_METRIC_REGISTER(_server_metric_entity,
ann_ivf_on_disk_fetch_page_cnt);
+ INT_COUNTER_METRIC_REGISTER(_server_metric_entity,
ann_ivf_on_disk_search_costs_ms);
+ INT_COUNTER_METRIC_REGISTER(_server_metric_entity,
ann_ivf_on_disk_search_cnt);
+ INT_COUNTER_METRIC_REGISTER(_server_metric_entity,
ann_ivf_on_disk_cache_hit_cnt);
+ INT_COUNTER_METRIC_REGISTER(_server_metric_entity,
ann_ivf_on_disk_cache_miss_cnt);
INT_COUNTER_METRIC_REGISTER(_server_metric_entity,
ann_index_in_memory_cnt);
INT_COUNTER_METRIC_REGISTER(_server_metric_entity,
ann_index_in_memory_rows_cnt);
INT_COUNTER_METRIC_REGISTER(_server_metric_entity, ann_index_construction);
diff --git a/be/src/common/metrics/doris_metrics.h
b/be/src/common/metrics/doris_metrics.h
index 75d6f32d4d1..faa4830c2b1 100644
--- a/be/src/common/metrics/doris_metrics.h
+++ b/be/src/common/metrics/doris_metrics.h
@@ -260,6 +260,12 @@ public:
IntCounter* ann_index_load_cnt = nullptr;
IntCounter* ann_index_search_costs_ms = nullptr;
IntCounter* ann_index_search_cnt = nullptr;
+ IntCounter* ann_ivf_on_disk_fetch_page_costs_ms = nullptr;
+ IntCounter* ann_ivf_on_disk_fetch_page_cnt = nullptr;
+ IntCounter* ann_ivf_on_disk_search_costs_ms = nullptr;
+ IntCounter* ann_ivf_on_disk_search_cnt = nullptr;
+ IntCounter* ann_ivf_on_disk_cache_hit_cnt = nullptr;
+ IntCounter* ann_ivf_on_disk_cache_miss_cnt = nullptr;
IntCounter* ann_index_in_memory_cnt = nullptr;
IntCounter* ann_index_in_memory_rows_cnt = nullptr;
IntCounter* ann_index_construction = nullptr;
diff --git a/be/src/exec/operator/olap_scan_operator.cpp
b/be/src/exec/operator/olap_scan_operator.cpp
index f0cc2157633..45c4728ba57 100644
--- a/be/src/exec/operator/olap_scan_operator.cpp
+++ b/be/src/exec/operator/olap_scan_operator.cpp
@@ -337,6 +337,11 @@ Status OlapScanLocalState::_init_profile() {
_ann_topn_engine_search_costs = ADD_CHILD_TIMER(
_segment_profile, "AnnIndexTopNEngineSearchCosts",
"AnnIndexTopNSearchCosts");
_ann_index_load_costs = ADD_TIMER(_segment_profile, "AnnIndexLoadCosts");
+ _ann_ivf_on_disk_load_costs = ADD_TIMER(_segment_profile,
"AnnIvfOnDiskLoadCosts");
+ _ann_ivf_on_disk_cache_hit_cnt =
+ ADD_COUNTER(_segment_profile, "AnnIvfOnDiskCacheHitCnt",
TUnit::UNIT);
+ _ann_ivf_on_disk_cache_miss_cnt =
+ ADD_COUNTER(_segment_profile, "AnnIvfOnDiskCacheMissCnt",
TUnit::UNIT);
_ann_topn_post_process_costs = ADD_CHILD_TIMER(
_segment_profile, "AnnIndexTopNResultPostProcessCosts",
"AnnIndexTopNSearchCosts");
_ann_topn_pre_process_costs = ADD_CHILD_TIMER(
diff --git a/be/src/exec/operator/olap_scan_operator.h
b/be/src/exec/operator/olap_scan_operator.h
index c03c317c73f..8f1da610a38 100644
--- a/be/src/exec/operator/olap_scan_operator.h
+++ b/be/src/exec/operator/olap_scan_operator.h
@@ -238,6 +238,9 @@ private:
RuntimeProfile::Counter* _ann_topn_search_cnt = nullptr;
RuntimeProfile::Counter* _ann_index_load_costs = nullptr;
+ RuntimeProfile::Counter* _ann_ivf_on_disk_load_costs = nullptr;
+ RuntimeProfile::Counter* _ann_ivf_on_disk_cache_hit_cnt = nullptr;
+ RuntimeProfile::Counter* _ann_ivf_on_disk_cache_miss_cnt = nullptr;
RuntimeProfile::Counter* _ann_topn_pre_process_costs = nullptr;
RuntimeProfile::Counter* _ann_topn_engine_search_costs = nullptr;
RuntimeProfile::Counter* _ann_topn_post_process_costs = nullptr;
diff --git a/be/src/exec/scan/olap_scanner.cpp
b/be/src/exec/scan/olap_scanner.cpp
index e11bd1d7717..342b5febd59 100644
--- a/be/src/exec/scan/olap_scanner.cpp
+++ b/be/src/exec/scan/olap_scanner.cpp
@@ -918,6 +918,11 @@ void OlapScanner::_collect_profile_before_close() {
stats.rows_ann_index_range_filtered);
COUNTER_UPDATE(local_state->_ann_topn_filter_counter,
stats.rows_ann_index_topn_filtered);
COUNTER_UPDATE(local_state->_ann_index_load_costs,
stats.ann_index_load_ns);
+ COUNTER_UPDATE(local_state->_ann_ivf_on_disk_load_costs,
stats.ann_ivf_on_disk_load_ns);
+ COUNTER_UPDATE(local_state->_ann_ivf_on_disk_cache_hit_cnt,
+ stats.ann_ivf_on_disk_cache_hit_cnt);
+ COUNTER_UPDATE(local_state->_ann_ivf_on_disk_cache_miss_cnt,
+ stats.ann_ivf_on_disk_cache_miss_cnt);
COUNTER_UPDATE(local_state->_ann_range_search_costs,
stats.ann_index_range_search_ns);
COUNTER_UPDATE(local_state->_ann_range_search_cnt,
stats.ann_index_range_search_cnt);
COUNTER_UPDATE(local_state->_ann_range_engine_search_costs,
stats.ann_range_engine_search_ns);
diff --git a/be/src/exec/scan/vector_search_user_params.h
b/be/src/exec/scan/vector_search_user_params.h
index eb2a4f439cd..dfe1bf89880 100644
--- a/be/src/exec/scan/vector_search_user_params.h
+++ b/be/src/exec/scan/vector_search_user_params.h
@@ -26,7 +26,7 @@ struct VectorSearchUserParams {
int hnsw_ef_search = 32;
bool hnsw_check_relative_distance = true;
bool hnsw_bounded_queue = true;
- int ivf_nprobe = 1;
+ int ivf_nprobe = 32;
bool operator==(const VectorSearchUserParams& other) const;
diff --git a/be/src/runtime/exec_env.h b/be/src/runtime/exec_env.h
index 4164d5cadac..4d7872430cf 100644
--- a/be/src/runtime/exec_env.h
+++ b/be/src/runtime/exec_env.h
@@ -121,6 +121,7 @@ class TabletColumnObjectPool;
class UserFunctionCache;
class SchemaCache;
class StoragePageCache;
+class AnnIndexIVFListCache;
class SegmentLoader;
class LookupConnectionCache;
class RowCache;
@@ -340,6 +341,9 @@ public:
this->_tablet_column_object_pool = c;
}
void set_storage_page_cache(StoragePageCache* c) {
this->_storage_page_cache = c; }
+ void set_ann_index_ivf_list_cache(AnnIndexIVFListCache* c) {
+ this->_ann_index_ivf_list_cache = c;
+ }
void set_segment_loader(SegmentLoader* sl) { this->_segment_loader = sl; }
void set_routine_load_task_executor(RoutineLoadTaskExecutor* r) {
this->_routine_load_task_executor = r;
@@ -377,6 +381,7 @@ public:
TabletColumnObjectPool* get_tablet_column_object_pool() { return
_tablet_column_object_pool; }
SchemaCache* schema_cache() { return _schema_cache; }
StoragePageCache* get_storage_page_cache() { return _storage_page_cache; }
+ AnnIndexIVFListCache* get_ann_index_ivf_list_cache() { return
_ann_index_ivf_list_cache; }
SegmentLoader* segment_loader() { return _segment_loader; }
LookupConnectionCache* get_lookup_connection_cache() { return
_lookup_connection_cache; }
RowCache* get_row_cache() { return _row_cache; }
@@ -536,6 +541,7 @@ private:
std::unique_ptr<BaseStorageEngine> _storage_engine;
SchemaCache* _schema_cache = nullptr;
StoragePageCache* _storage_page_cache = nullptr;
+ AnnIndexIVFListCache* _ann_index_ivf_list_cache = nullptr;
SegmentLoader* _segment_loader = nullptr;
LookupConnectionCache* _lookup_connection_cache = nullptr;
RowCache* _row_cache = nullptr;
diff --git a/be/src/runtime/exec_env_init.cpp b/be/src/runtime/exec_env_init.cpp
index 3b46ca53cb5..38fec145eef 100644
--- a/be/src/runtime/exec_env_init.cpp
+++ b/be/src/runtime/exec_env_init.cpp
@@ -97,6 +97,7 @@
#include "service/backend_options.h"
#include "service/backend_service.h"
#include "service/point_query_executor.h"
+#include "storage/cache/ann_index_ivf_list_cache.h"
#include "storage/cache/page_cache.h"
#include "storage/cache/schema_cache.h"
#include "storage/id_manager.h"
@@ -580,6 +581,20 @@ Status ExecEnv::init_mem_env() {
<< PrettyPrinter::print(storage_cache_limit, TUnit::BYTES)
<< ", origin config value: " << config::storage_page_cache_limit;
+ // Init ANN index IVF list cache (dedicated cache for IVF on disk)
+ {
+ int64_t ann_cache_limit =
ParseUtil::parse_mem_spec(config::ann_index_ivf_list_cache_limit,
+
MemInfo::mem_limit(),
+
MemInfo::physical_mem(), &is_percent);
+ while (!is_percent && ann_cache_limit > MemInfo::mem_limit() / 2) {
+ ann_cache_limit = ann_cache_limit / 2;
+ }
+ _ann_index_ivf_list_cache =
AnnIndexIVFListCache::create_global_cache(ann_cache_limit);
+ LOG(INFO) << "ANN index IVF list cache memory limit: "
+ << PrettyPrinter::print(ann_cache_limit, TUnit::BYTES)
+ << ", origin config value: " <<
config::ann_index_ivf_list_cache_limit;
+ }
+
// Init row cache
int64_t row_cache_mem_limit =
ParseUtil::parse_mem_spec(config::row_cache_mem_limit,
MemInfo::mem_limit(),
@@ -883,6 +898,7 @@ void ExecEnv::destroy() {
SAFE_DELETE(_tablet_column_object_pool);
// _storage_page_cache must be destoried before _cache_manager
+ SAFE_DELETE(_ann_index_ivf_list_cache);
SAFE_DELETE(_storage_page_cache);
SAFE_DELETE(_small_file_mgr);
diff --git a/be/src/runtime/memory/cache_manager.cpp
b/be/src/runtime/memory/cache_manager.cpp
index 352025f7bc9..e3529fc33d6 100644
--- a/be/src/runtime/memory/cache_manager.cpp
+++ b/be/src/runtime/memory/cache_manager.cpp
@@ -73,6 +73,15 @@ int64_t CacheManager::for_each_cache_refresh_capacity(double
adjust_weighted,
if (!cache_policy->enable_prune()) {
continue;
}
+ // TODO(hezhiqiang): Refactor this to a generic
`enable_adjust_capacity` attribute
+ // in CachePolicy so each cache can declare whether it participates
in dynamic
+ // capacity adjustment, instead of hardcoding the skip here.
+ // Skip ANN IVF list cache: its entries are loaded via expensive
disk/network
+ // pread and should not be evicted by transient memory pressure
fluctuations.
+ // It already has its own capacity limit and stale-sweep mechanism.
+ if (cache_policy->type() ==
CachePolicy::CacheType::ANN_INDEX_IVF_LIST_CACHE) {
+ continue;
+ }
cache_policy->adjust_capacity_weighted(adjust_weighted);
freed_size +=
cache_policy->profile()->get_counter("FreedMemory")->value();
if (cache_policy->profile()->get_counter("FreedMemory")->value() != 0
&& profile) {
diff --git a/be/src/runtime/memory/cache_policy.h
b/be/src/runtime/memory/cache_policy.h
index 3d8b12d9966..671c137c1a5 100644
--- a/be/src/runtime/memory/cache_policy.h
+++ b/be/src/runtime/memory/cache_policy.h
@@ -57,6 +57,7 @@ public:
TABLET_COLUMN_OBJECT_POOL = 21,
SCHEMA_CLOUD_DICTIONARY_CACHE = 22,
CONDITION_CACHE = 23,
+ ANN_INDEX_IVF_LIST_CACHE = 24,
};
static std::string type_string(CacheType type) {
@@ -107,6 +108,8 @@ public:
return "SchemaCloudDictionaryCache";
case CacheType::CONDITION_CACHE:
return "ConditionCache";
+ case CacheType::ANN_INDEX_IVF_LIST_CACHE:
+ return "AnnIndexIVFListCache";
default:
throw Exception(Status::FatalError("not match type of cache policy
:{}",
static_cast<int>(type)));
@@ -136,7 +139,8 @@ public:
{"ForUTCacheNumber", CacheType::FOR_UT_CACHE_NUMBER},
{"QueryCache", CacheType::QUERY_CACHE},
{"TabletColumnObjectPool", CacheType::TABLET_COLUMN_OBJECT_POOL},
- {"ConditionCache", CacheType::CONDITION_CACHE}};
+ {"ConditionCache", CacheType::CONDITION_CACHE},
+ {"AnnIndexIVFListCache", CacheType::ANN_INDEX_IVF_LIST_CACHE}};
static CacheType string_to_type(std::string type) {
if (StringToType.contains(type)) {
diff --git a/be/src/runtime/runtime_state.h b/be/src/runtime/runtime_state.h
index 5445cfb1d03..e3fc0132e20 100644
--- a/be/src/runtime/runtime_state.h
+++ b/be/src/runtime/runtime_state.h
@@ -811,9 +811,12 @@ public:
void set_id_file_map();
VectorSearchUserParams get_vector_search_params() const {
- return VectorSearchUserParams(_query_options.hnsw_ef_search,
-
_query_options.hnsw_check_relative_distance,
- _query_options.hnsw_bounded_queue,
_query_options.ivf_nprobe);
+ VectorSearchUserParams params;
+ params.hnsw_ef_search = _query_options.hnsw_ef_search;
+ params.hnsw_check_relative_distance =
_query_options.hnsw_check_relative_distance;
+ params.hnsw_bounded_queue = _query_options.hnsw_bounded_queue;
+ params.ivf_nprobe = _query_options.ivf_nprobe;
+ return params;
}
bool runtime_filter_wait_infinitely() const {
diff --git a/be/src/storage/cache/ann_index_ivf_list_cache.cpp
b/be/src/storage/cache/ann_index_ivf_list_cache.cpp
new file mode 100644
index 00000000000..d07ea646eaa
--- /dev/null
+++ b/be/src/storage/cache/ann_index_ivf_list_cache.cpp
@@ -0,0 +1,61 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "storage/cache/ann_index_ivf_list_cache.h"
+
+#include <glog/logging.h>
+
+#include "common/config.h"
+#include "runtime/memory/lru_cache_policy.h"
+
+namespace doris {
+
+AnnIndexIVFListCache* AnnIndexIVFListCache::_s_instance = nullptr;
+
+AnnIndexIVFListCache* AnnIndexIVFListCache::create_global_cache(size_t
capacity,
+ uint32_t
num_shards) {
+ DCHECK(_s_instance == nullptr);
+ _s_instance = new AnnIndexIVFListCache(capacity, num_shards);
+ return _s_instance;
+}
+
+void AnnIndexIVFListCache::destroy_global_cache() {
+ delete _s_instance;
+ _s_instance = nullptr;
+}
+
+AnnIndexIVFListCache::AnnIndexIVFListCache(size_t capacity, uint32_t
num_shards) {
+ _cache = std::make_unique<CacheImpl>(capacity, num_shards);
+}
+
+bool AnnIndexIVFListCache::lookup(const CacheKey& key, PageCacheHandle*
handle) {
+ auto* lru_handle = _cache->lookup(key.encode());
+ if (lru_handle == nullptr) {
+ return false;
+ }
+ *handle = PageCacheHandle(_cache.get(), lru_handle);
+ return true;
+}
+
+void AnnIndexIVFListCache::insert(const CacheKey& key, DataPage* page,
PageCacheHandle* handle) {
+ CachePriority priority = CachePriority::NORMAL;
+ auto* lru_handle = _cache->insert(key.encode(), page, page->capacity(), 0,
priority);
+ DCHECK(lru_handle != nullptr);
+ *handle = PageCacheHandle(_cache.get(), lru_handle);
+}
+
+} // namespace doris
diff --git a/be/src/storage/cache/ann_index_ivf_list_cache.h
b/be/src/storage/cache/ann_index_ivf_list_cache.h
new file mode 100644
index 00000000000..9fc9a00e114
--- /dev/null
+++ b/be/src/storage/cache/ann_index_ivf_list_cache.h
@@ -0,0 +1,95 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <stddef.h>
+#include <stdint.h>
+
+#include <memory>
+#include <string>
+#include <utility>
+
+#include "runtime/memory/lru_cache_policy.h"
+#include "storage/cache/page_cache.h"
+
+namespace doris {
+
+// Dedicated LRU cache for IVF on-disk inverted list data.
+//
+// Each cache entry corresponds to one IVF list's codes or ids region,
+// keyed by (file-prefix, file-size, region-offset). This gives perfect
+// cache alignment with the IVF access pattern: every search probes a
+// fixed set of lists and each list's codes/ids are always the same
+// (offset, size) pair, so repeated queries hit the cache without any
+// partial-block copies.
+//
+// Separated from StoragePageCache so that:
+// 1. IVF list data does not compete with column-data / index pages.
+// 2. Capacity can be tuned independently (default: 70% of physical memory
+// when ann_index_ivf_list_cache_limit == "0").
+class AnnIndexIVFListCache {
+public:
+ // Reuse the same CacheKey format as StoragePageCache for consistency.
+ // offset field stores the exact byte offset of the list's codes or ids.
+ using CacheKey = StoragePageCache::CacheKey;
+
+ class CacheImpl : public LRUCachePolicy {
+ public:
+ CacheImpl(size_t capacity, uint32_t num_shards)
+ :
LRUCachePolicy(CachePolicy::CacheType::ANN_INDEX_IVF_LIST_CACHE, capacity,
+ LRUCacheType::SIZE,
+
config::ann_index_ivf_list_cache_stale_sweep_time_sec, num_shards,
+ /*element_count_capacity*/ 0,
/*enable_prune*/ true,
+ /*is lru-k*/ false) {}
+ };
+
+ static constexpr uint32_t kDefaultNumShards = 16;
+
+ // --------------- singleton ---------------
+ static AnnIndexIVFListCache* instance() { return _s_instance; }
+
+ static AnnIndexIVFListCache* create_global_cache(size_t capacity,
+ uint32_t num_shards =
kDefaultNumShards);
+ static void destroy_global_cache();
+
+ // --------------- ctor / dtor ---------------
+ AnnIndexIVFListCache(size_t capacity, uint32_t num_shards);
+ ~AnnIndexIVFListCache() = default;
+
+ AnnIndexIVFListCache(const AnnIndexIVFListCache&) = delete;
+ AnnIndexIVFListCache& operator=(const AnnIndexIVFListCache&) = delete;
+
+ // --------------- operations ---------------
+
+ // Lookup a cached entry. Returns true on hit; `handle` is populated and
+ // the caller must let it go out of scope (or call reset()) to unpin.
+ bool lookup(const CacheKey& key, PageCacheHandle* handle);
+
+ // Insert an entry into the cache. On success the cache takes ownership of
+ // `page` (via the handle). Caller must release() the unique_ptr.
+ void insert(const CacheKey& key, DataPage* page, PageCacheHandle* handle);
+
+ std::shared_ptr<MemTrackerLimiter> mem_tracker() { return
_cache->mem_tracker(); }
+
+private:
+ static AnnIndexIVFListCache* _s_instance;
+
+ std::unique_ptr<CacheImpl> _cache;
+};
+
+} // namespace doris
diff --git a/be/src/storage/cache/page_cache.cpp
b/be/src/storage/cache/page_cache.cpp
index 52c3dad8e04..c71837f9de4 100644
--- a/be/src/storage/cache/page_cache.cpp
+++ b/be/src/storage/cache/page_cache.cpp
@@ -38,6 +38,11 @@ MemoryTrackedPageBase<T>::MemoryTrackedPageBase(size_t size,
bool use_cache,
}
}
+template <typename T>
+MemoryTrackedPageBase<T>::MemoryTrackedPageBase(size_t size,
+
std::shared_ptr<MemTrackerLimiter> mem_tracker)
+ : _size(size), _mem_tracker_by_allocator(std::move(mem_tracker)) {}
+
MemoryTrackedPageWithPageEntity::MemoryTrackedPageWithPageEntity(size_t size,
bool use_cache,
segment_v2::PageTypePB page_type)
: MemoryTrackedPageBase<char*>(size, use_cache, page_type),
_capacity(size) {
@@ -48,6 +53,16 @@
MemoryTrackedPageWithPageEntity::MemoryTrackedPageWithPageEntity(size_t size, bo
}
}
+MemoryTrackedPageWithPageEntity::MemoryTrackedPageWithPageEntity(
+ size_t size, std::shared_ptr<MemTrackerLimiter> mem_tracker)
+ : MemoryTrackedPageBase<char*>(size, std::move(mem_tracker)),
_capacity(size) {
+ {
+
SCOPED_SWITCH_THREAD_MEM_TRACKER_LIMITER(this->_mem_tracker_by_allocator);
+ this->_data = reinterpret_cast<char*>(
+ Allocator<false>::alloc(this->_capacity,
ALLOCATOR_ALIGNMENT_16));
+ }
+}
+
MemoryTrackedPageWithPageEntity::~MemoryTrackedPageWithPageEntity() {
if (this->_data != nullptr) {
DCHECK(this->_capacity != 0 && this->_size != 0);
diff --git a/be/src/storage/cache/page_cache.h
b/be/src/storage/cache/page_cache.h
index 8f37548a9c0..1cc98c5cb49 100644
--- a/be/src/storage/cache/page_cache.h
+++ b/be/src/storage/cache/page_cache.h
@@ -42,6 +42,9 @@ class MemoryTrackedPageBase : public LRUCacheValueBase {
public:
MemoryTrackedPageBase() = default;
MemoryTrackedPageBase(size_t b, bool use_cache, segment_v2::PageTypePB
page_type);
+ // Construct with an explicit mem-tracker (for caches outside
StoragePageCache,
+ // e.g. AnnIndexIVFListCache).
+ MemoryTrackedPageBase(size_t b, std::shared_ptr<MemTrackerLimiter>
mem_tracker);
MemoryTrackedPageBase(const MemoryTrackedPageBase&) = delete;
MemoryTrackedPageBase& operator=(const MemoryTrackedPageBase&) = delete;
@@ -59,6 +62,8 @@ protected:
class MemoryTrackedPageWithPageEntity : Allocator<false>, public
MemoryTrackedPageBase<char*> {
public:
MemoryTrackedPageWithPageEntity(size_t b, bool use_cache,
segment_v2::PageTypePB page_type);
+ // Construct with an explicit mem-tracker.
+ MemoryTrackedPageWithPageEntity(size_t b,
std::shared_ptr<MemTrackerLimiter> mem_tracker);
size_t capacity() { return this->_capacity; }
diff --git a/be/src/storage/index/ann/ann_index.cpp
b/be/src/storage/index/ann/ann_index.cpp
index 7fca0d1ad16..6b2f2849656 100644
--- a/be/src/storage/index/ann/ann_index.cpp
+++ b/be/src/storage/index/ann/ann_index.cpp
@@ -51,6 +51,8 @@ std::string ann_index_type_to_string(AnnIndexType type) {
return "hnsw";
case AnnIndexType::IVF:
return "ivf";
+ case AnnIndexType::IVF_ON_DISK:
+ return "ivf_on_disk";
default:
return "unknown";
}
@@ -61,6 +63,8 @@ AnnIndexType string_to_ann_index_type(const std::string&
type) {
return AnnIndexType::HNSW;
} else if (type == "ivf") {
return AnnIndexType::IVF;
+ } else if (type == "ivf_on_disk") {
+ return AnnIndexType::IVF_ON_DISK;
} else {
return AnnIndexType::UNKNOWN;
}
diff --git a/be/src/storage/index/ann/ann_index.h
b/be/src/storage/index/ann/ann_index.h
index da83fb4c584..497e46c8e55 100644
--- a/be/src/storage/index/ann/ann_index.h
+++ b/be/src/storage/index/ann/ann_index.h
@@ -55,7 +55,7 @@ std::string metric_to_string(AnnIndexMetric metric);
AnnIndexMetric string_to_metric(const std::string& metric);
-enum class AnnIndexType { UNKNOWN, HNSW, IVF };
+enum class AnnIndexType { UNKNOWN, HNSW, IVF, IVF_ON_DISK };
std::string ann_index_type_to_string(AnnIndexType type);
diff --git a/be/src/storage/index/ann/ann_index_files.h
b/be/src/storage/index/ann/ann_index_files.h
index c4bdd96dbbc..846ac7b6a06 100644
--- a/be/src/storage/index/ann/ann_index_files.h
+++ b/be/src/storage/index/ann/ann_index_files.h
@@ -23,5 +23,6 @@
namespace doris::segment_v2 {
inline constexpr char faiss_index_fila_name[] = "ann.faiss";
+inline constexpr char faiss_ivfdata_file_name[] = "ann.ivfdata";
} // namespace doris::segment_v2
diff --git a/be/src/storage/index/ann/ann_index_reader.cpp
b/be/src/storage/index/ann/ann_index_reader.cpp
index 5fb94677f40..8844e8aaec9 100644
--- a/be/src/storage/index/ann/ann_index_reader.cpp
+++ b/be/src/storage/index/ann/ann_index_reader.cpp
@@ -77,8 +77,7 @@ Status AnnIndexReader::load_index(io::IOContext* io_ctx) {
// An exception will be thrown if loading fails
RETURN_IF_ERROR(
_index_file_reader->init(config::inverted_index_read_buffer_size, io_ctx));
- Result<std::unique_ptr<DorisCompoundReader, DirectoryDeleter>>
compound_dir;
- compound_dir = _index_file_reader->open(&_index_meta, io_ctx);
+ auto compound_dir = _index_file_reader->open(&_index_meta, io_ctx);
if (!compound_dir.has_value()) {
return Status::IOError("Failed to open index file: {}",
compound_dir.error().to_string());
@@ -86,7 +85,20 @@ Status AnnIndexReader::load_index(io::IOContext* io_ctx) {
_vector_index = std::make_unique<FaissVectorIndex>();
_vector_index->set_metric(_metric_type);
_vector_index->set_type(_index_type);
+ // Provide a cache key prefix so IVF_ON_DISK can cache ivfdata
+ // blocks in StoragePageCache. Use cache_key (which includes
+ // index_id) rather than file_path, because the idx file is a
+ // compound file shared by multiple indexes.
+ static_cast<FaissVectorIndex*>(_vector_index.get())
+ ->set_ivfdata_cache_key_prefix(
+
_index_file_reader->get_index_file_cache_key(&_index_meta));
RETURN_IF_ERROR(_vector_index->load(compound_dir->get()));
+ // Keep the compound directory alive. For IVF_ON_DISK the
+ // CachedRandomAccessReader holds a cloned CSIndexInput whose
+ // `base` raw pointer references the compound reader's underlying
+ // stream. Destroying the directory would make `base` dangling.
+ // For other types this is harmless (just holds a file handle).
+ _compound_dir = std::move(*compound_dir);
} catch (CLuceneError& err) {
LOG_ERROR("Failed to load ann index: {}", err.what());
return Status::Error<ErrorCode::INVERTED_INDEX_CLUCENE_ERROR>(
@@ -123,6 +135,7 @@ Status AnnIndexReader::query(io::IOContext* io_ctx,
AnnTopNParam* param, AnnInde
HNSWSearchParameters hnsw_search_params;
hnsw_search_params.roaring = param->roaring;
hnsw_search_params.rows_of_segment = param->rows_of_segment;
+ hnsw_search_params.io_ctx = io_ctx;
hnsw_search_params.ef_search = param->_user_params.hnsw_ef_search;
hnsw_search_params.check_relative_distance =
param->_user_params.hnsw_check_relative_distance;
@@ -133,10 +146,11 @@ Status AnnIndexReader::query(io::IOContext* io_ctx,
AnnTopNParam* param, AnnInde
stats->engine_search_ns.update(index_search_result.engine_search_ns);
stats->engine_convert_ns.update(index_search_result.engine_convert_ns);
stats->engine_prepare_ns.update(index_search_result.engine_prepare_ns);
- } else if (_index_type == AnnIndexType::IVF) {
+ } else if (_index_type == AnnIndexType::IVF || _index_type ==
AnnIndexType::IVF_ON_DISK) {
IVFSearchParameters ivf_search_params;
ivf_search_params.roaring = param->roaring;
ivf_search_params.rows_of_segment = param->rows_of_segment;
+ ivf_search_params.io_ctx = io_ctx;
ivf_search_params.nprobe = param->_user_params.ivf_nprobe;
RETURN_IF_ERROR(_vector_index->ann_topn_search(query_vec, limit,
ivf_search_params,
index_search_result));
@@ -144,14 +158,24 @@ Status AnnIndexReader::query(io::IOContext* io_ctx,
AnnTopNParam* param, AnnInde
stats->engine_search_ns.update(index_search_result.engine_search_ns);
stats->engine_convert_ns.update(index_search_result.engine_convert_ns);
stats->engine_prepare_ns.update(index_search_result.engine_prepare_ns);
+ if (_index_type == AnnIndexType::IVF_ON_DISK) {
+ stats->ivf_on_disk_cache_hit_cnt.update(
+ index_search_result.ivf_on_disk_cache_hit_cnt);
+ stats->ivf_on_disk_cache_miss_cnt.update(
+ index_search_result.ivf_on_disk_cache_miss_cnt);
+
DorisMetrics::instance()->ann_ivf_on_disk_cache_hit_cnt->increment(
+ index_search_result.ivf_on_disk_cache_hit_cnt);
+
DorisMetrics::instance()->ann_ivf_on_disk_cache_miss_cnt->increment(
+ index_search_result.ivf_on_disk_cache_miss_cnt);
+ }
} else {
throw Exception(Status::NotSupported("Unsupported index type: {}",
ann_index_type_to_string(_index_type)));
}
- DCHECK(index_search_result.roaring != nullptr);
- DCHECK(index_search_result.distances != nullptr);
- DCHECK(index_search_result.row_ids != nullptr);
+ DORIS_CHECK(index_search_result.roaring != nullptr);
+ DORIS_CHECK(index_search_result.distances != nullptr);
+ DORIS_CHECK(index_search_result.row_ids != nullptr);
param->distance = std::make_unique<std::vector<float>>();
{
SCOPED_TIMER(&(stats->result_process_costs_ns));
@@ -163,6 +187,13 @@ Status AnnIndexReader::query(io::IOContext* io_ctx,
AnnTopNParam* param, AnnInde
double search_costs_ms =
static_cast<double>(stats->search_costs_ns.value()) / 1000.0;
DorisMetrics::instance()->ann_index_search_costs_ms->increment(
static_cast<int64_t>(search_costs_ms));
+ if (_index_type == AnnIndexType::IVF_ON_DISK) {
+
stats->ivf_on_disk_search_costs_ns.update(stats->search_costs_ns.value());
+ stats->ivf_on_disk_search_cnt.update(1);
+ DorisMetrics::instance()->ann_ivf_on_disk_search_costs_ms->increment(
+ static_cast<int64_t>(search_costs_ms));
+ DorisMetrics::instance()->ann_ivf_on_disk_search_cnt->increment(1);
+ }
return Status::OK();
}
@@ -187,7 +218,7 @@ Status AnnIndexReader::range_search(const
AnnRangeSearchParams& params,
hnsw_param->check_relative_distance =
custom_params.hnsw_check_relative_distance;
hnsw_param->bounded_queue = custom_params.hnsw_bounded_queue;
search_param = std::move(hnsw_param);
- } else if (_index_type == AnnIndexType::IVF) {
+ } else if (_index_type == AnnIndexType::IVF || _index_type ==
AnnIndexType::IVF_ON_DISK) {
auto ivf_param =
std::make_unique<segment_v2::IVFSearchParameters>();
ivf_param->nprobe = custom_params.ivf_nprobe;
search_param = std::move(ivf_param);
@@ -198,6 +229,7 @@ Status AnnIndexReader::range_search(const
AnnRangeSearchParams& params,
search_param->is_le_or_lt = params.is_le_or_lt;
search_param->roaring = params.roaring;
+ search_param->io_ctx = io_ctx;
DCHECK(search_param->roaring != nullptr);
RETURN_IF_ERROR(_vector_index->range_search(params.query_value,
params.radius,
@@ -206,6 +238,14 @@ Status AnnIndexReader::range_search(const
AnnRangeSearchParams& params,
stats->engine_prepare_ns.update(search_result.engine_prepare_ns);
stats->engine_search_ns.update(search_result.engine_search_ns);
stats->engine_convert_ns.update(search_result.engine_convert_ns);
+ if (_index_type == AnnIndexType::IVF_ON_DISK) {
+
stats->ivf_on_disk_cache_hit_cnt.update(search_result.ivf_on_disk_cache_hit_cnt);
+
stats->ivf_on_disk_cache_miss_cnt.update(search_result.ivf_on_disk_cache_miss_cnt);
+ DorisMetrics::instance()->ann_ivf_on_disk_cache_hit_cnt->increment(
+ search_result.ivf_on_disk_cache_hit_cnt);
+
DorisMetrics::instance()->ann_ivf_on_disk_cache_miss_cnt->increment(
+ search_result.ivf_on_disk_cache_miss_cnt);
+ }
DCHECK(search_result.roaring != nullptr);
result->roaring = search_result.roaring;
@@ -243,6 +283,13 @@ Status AnnIndexReader::range_search(const
AnnRangeSearchParams& params,
double search_costs_ms =
static_cast<double>(stats->search_costs_ns.value()) / 1000.0;
DorisMetrics::instance()->ann_index_search_costs_ms->increment(
static_cast<int64_t>(search_costs_ms));
+ if (_index_type == AnnIndexType::IVF_ON_DISK) {
+
stats->ivf_on_disk_search_costs_ns.update(stats->search_costs_ns.value());
+ stats->ivf_on_disk_search_cnt.update(1);
+ DorisMetrics::instance()->ann_ivf_on_disk_search_costs_ms->increment(
+ static_cast<int64_t>(search_costs_ms));
+ DorisMetrics::instance()->ann_ivf_on_disk_search_cnt->increment(1);
+ }
return Status::OK();
}
diff --git a/be/src/storage/index/ann/ann_index_reader.h
b/be/src/storage/index/ann/ann_index_reader.h
index 7ab677fc279..0f2d5839e72 100644
--- a/be/src/storage/index/ann/ann_index_reader.h
+++ b/be/src/storage/index/ann/ann_index_reader.h
@@ -21,6 +21,8 @@
#include "storage/index/ann/ann_index.h"
#include "storage/index/ann/ann_search_params.h"
#include "storage/index/index_reader.h"
+#include "storage/index/inverted/inverted_index_common.h"
+#include "storage/index/inverted/inverted_index_compound_reader.h"
#include "storage/tablet/tablet_schema.h"
#include "util/once.h"
namespace doris::segment_v2 {
@@ -68,6 +70,13 @@ public:
private:
TabletIndex _index_meta;
std::shared_ptr<IndexFileReader> _index_file_reader;
+ // IMPORTANT: _compound_dir MUST be declared before _vector_index so that
it is
+ // destroyed AFTER _vector_index (C++ destroys members in reverse
declaration order).
+ // For IVF_ON_DISK, the CachedRandomAccessReader inside _vector_index
holds a cloned
+ // CSIndexInput whose `base` raw pointer references the compound
directory's underlying
+ // stream. If _compound_dir were destroyed first, that `base` would
dangle, causing
+ // a use-after-free when ~CachedRandomAccessReader() calls _input->close().
+ std::unique_ptr<DorisCompoundReader, DirectoryDeleter> _compound_dir;
std::unique_ptr<VectorIndex> _vector_index;
AnnIndexType _index_type;
AnnIndexMetric _metric_type;
diff --git a/be/src/storage/index/ann/ann_search_params.h
b/be/src/storage/index/ann/ann_search_params.h
index 37ba6f875e4..4cccac296aa 100644
--- a/be/src/storage/index/ann/ann_search_params.h
+++ b/be/src/storage/index/ann/ann_search_params.h
@@ -39,6 +39,10 @@
#include "exec/scan/vector_search_user_params.h"
#include "runtime/runtime_profile.h"
+namespace doris::io {
+struct IOContext;
+} // namespace doris::io
+
namespace doris::segment_v2 {
#include "common/compile_check_begin.h"
@@ -50,6 +54,11 @@ struct AnnIndexStats {
result_process_costs_ns(TUnit::TIME_NS, 0),
engine_convert_ns(TUnit::TIME_NS, 0),
engine_prepare_ns(TUnit::TIME_NS, 0),
+ ivf_on_disk_load_costs_ns(TUnit::TIME_NS, 0),
+ ivf_on_disk_search_costs_ns(TUnit::TIME_NS, 0),
+ ivf_on_disk_search_cnt(TUnit::UNIT, 0),
+ ivf_on_disk_cache_hit_cnt(TUnit::UNIT, 0),
+ ivf_on_disk_cache_miss_cnt(TUnit::UNIT, 0),
fall_back_brute_force_cnt(0) {}
AnnIndexStats(const AnnIndexStats& other)
@@ -59,6 +68,12 @@ struct AnnIndexStats {
result_process_costs_ns(TUnit::TIME_NS,
other.result_process_costs_ns.value()),
engine_convert_ns(TUnit::TIME_NS,
other.engine_convert_ns.value()),
engine_prepare_ns(TUnit::TIME_NS,
other.engine_prepare_ns.value()),
+ ivf_on_disk_load_costs_ns(TUnit::TIME_NS,
other.ivf_on_disk_load_costs_ns.value()),
+ ivf_on_disk_search_costs_ns(TUnit::TIME_NS,
+
other.ivf_on_disk_search_costs_ns.value()),
+ ivf_on_disk_search_cnt(TUnit::UNIT,
other.ivf_on_disk_search_cnt.value()),
+ ivf_on_disk_cache_hit_cnt(TUnit::UNIT,
other.ivf_on_disk_cache_hit_cnt.value()),
+ ivf_on_disk_cache_miss_cnt(TUnit::UNIT,
other.ivf_on_disk_cache_miss_cnt.value()),
fall_back_brute_force_cnt(other.fall_back_brute_force_cnt) {}
AnnIndexStats& operator=(const AnnIndexStats& other) {
@@ -69,6 +84,11 @@ struct AnnIndexStats {
result_process_costs_ns.set(other.result_process_costs_ns.value());
engine_convert_ns.set(other.engine_convert_ns.value());
engine_prepare_ns.set(other.engine_prepare_ns.value());
+
ivf_on_disk_load_costs_ns.set(other.ivf_on_disk_load_costs_ns.value());
+
ivf_on_disk_search_costs_ns.set(other.ivf_on_disk_search_costs_ns.value());
+ ivf_on_disk_search_cnt.set(other.ivf_on_disk_search_cnt.value());
+
ivf_on_disk_cache_hit_cnt.set(other.ivf_on_disk_cache_hit_cnt.value());
+
ivf_on_disk_cache_miss_cnt.set(other.ivf_on_disk_cache_miss_cnt.value());
fall_back_brute_force_cnt = other.fall_back_brute_force_cnt;
}
return *this;
@@ -80,7 +100,12 @@ struct AnnIndexStats {
RuntimeProfile::Counter result_process_costs_ns; // time cost of
processing search results
RuntimeProfile::Counter engine_convert_ns; // time cost of
engine-side conversions
RuntimeProfile::Counter
- engine_prepare_ns; // time cost before engine search
(allocations, setup)
+ engine_prepare_ns; // time cost before engine search (allocations,
setup)
+ RuntimeProfile::Counter ivf_on_disk_load_costs_ns; // IVF_ON_DISK index
load costs
+ RuntimeProfile::Counter ivf_on_disk_search_costs_ns; // IVF_ON_DISK search
costs
+ RuntimeProfile::Counter ivf_on_disk_search_cnt; // IVF_ON_DISK search
count
+ RuntimeProfile::Counter ivf_on_disk_cache_hit_cnt; // IVF_ON_DISK cache
hit count
+ RuntimeProfile::Counter ivf_on_disk_cache_miss_cnt; // IVF_ON_DISK cache
miss count
int64_t fall_back_brute_force_cnt; // fallback count when ANN range search
is bypassed
};
@@ -132,12 +157,18 @@ struct IndexSearchResult {
int64_t engine_search_ns = 0; // time spent in the underlying index
search call
int64_t engine_convert_ns = 0; // time spent building selectors/results
inside the engine
int64_t engine_prepare_ns = 0; // time spent preparing buffers before
engine search
+ int64_t ivf_on_disk_cache_hit_cnt = 0;
+ int64_t ivf_on_disk_cache_miss_cnt = 0;
};
struct IndexSearchParameters {
roaring::Roaring* roaring = nullptr;
bool is_le_or_lt = true;
size_t rows_of_segment = 0;
+ // Caller-owned IOContext, valid for the duration of the search.
+ // Propagated via thread_local to CachedRandomAccessReader so that
+ // file-cache statistics are attributed to the correct query.
+ const io::IOContext* io_ctx = nullptr;
virtual ~IndexSearchParameters() = default;
};
diff --git a/be/src/storage/index/ann/ann_topn_runtime.cpp
b/be/src/storage/index/ann/ann_topn_runtime.cpp
index 488dff5f6a3..6ad413d799a 100644
--- a/be/src/storage/index/ann/ann_topn_runtime.cpp
+++ b/be/src/storage/index/ann/ann_topn_runtime.cpp
@@ -187,7 +187,7 @@ Status AnnTopNRuntime::prepare(RuntimeState* state, const
RowDescriptor& row_des
_metric_type = segment_v2::string_to_metric(metric_name);
- VLOG_DEBUG << "AnnTopNRuntime: {}" << this->debug_string();
+ VLOG_DEBUG << fmt::format("AnnTopNRuntime: {}", this->debug_string());
return Status::OK();
}
diff --git a/be/src/storage/index/ann/faiss_ann_index.cpp
b/be/src/storage/index/ann/faiss_ann_index.cpp
index 448dcee5749..097a5ce26d6 100644
--- a/be/src/storage/index/ann/faiss_ann_index.cpp
+++ b/be/src/storage/index/ann/faiss_ann_index.cpp
@@ -18,6 +18,9 @@
#include "storage/index/ann/faiss_ann_index.h"
#include <faiss/index_io.h>
+#include <faiss/invlists/OnDiskInvertedLists.h>
+#include <faiss/invlists/PreadInvertedLists.h>
+#include <gen_cpp/segment_v2.pb.h>
#include <omp.h>
#include <pthread.h>
@@ -50,6 +53,9 @@
#include "faiss/impl/FaissException.h"
#include "faiss/impl/IDSelector.h"
#include "faiss/impl/io.h"
+#include "io/io_common.h"
+#include "storage/cache/ann_index_ivf_list_cache.h"
+#include "storage/cache/page_cache.h"
#include "storage/index/ann/ann_index.h"
#include "storage/index/ann/ann_index_files.h"
#include "storage/index/ann/ann_search_params.h"
@@ -65,6 +71,30 @@ std::mutex g_omp_thread_mutex;
std::condition_variable g_omp_thread_cv;
int g_index_threads_in_use = 0;
+struct IvfOnDiskCacheStats {
+ int64_t hit_cnt = 0;
+ int64_t miss_cnt = 0;
+};
+
+thread_local IvfOnDiskCacheStats g_ivf_on_disk_cache_stats;
+
+// Per-thread IOContext pointer, set by ScopedIoCtxBinding around FAISS
+// search calls so that CachedRandomAccessReader::_read_clucene() can
+// forward the caller's IOContext to the CLucene IndexInput. This is
+// the same thread_local trick used for g_ivf_on_disk_cache_stats:
+// borrow() is a FAISS callback with a fixed signature, so there is no
+// way to pass per-query context through the call chain.
+thread_local const io::IOContext* g_current_io_ctx = nullptr;
+
+// RAII guard that binds g_current_io_ctx for the lifetime of a search.
+class ScopedIoCtxBinding {
+public:
+ explicit ScopedIoCtxBinding(const io::IOContext* ctx) { g_current_io_ctx =
ctx; }
+ ~ScopedIoCtxBinding() { g_current_io_ctx = nullptr; }
+ ScopedIoCtxBinding(const ScopedIoCtxBinding&) = delete;
+ ScopedIoCtxBinding& operator=(const ScopedIoCtxBinding&) = delete;
+};
+
// Guard that ensures the total OpenMP threads used by concurrent index builds
// never exceed the configured omp_threads_limit.
class ScopedOmpThreadBudget {
@@ -155,6 +185,15 @@ private:
const roaring::Roaring* _roaring;
};
+// TODO(dynamic nprobe): SPANN-style query-aware dynamic nprobe pruning.
+// After coarse quantization, if the distance from the query vector to
subsequent
+// centroids is much larger than the distance to the closest centroid, those
+// far-away clusters are unlikely to contain true nearest neighbors and can be
+// skipped. This adaptively determines the actual nprobe per query based on a
+// distance ratio threshold, avoiding unnecessary inverted list scans.
+// Reference: Chen et al., "SPANN: Highly-efficient Billion-scale Approximate
+// Nearest Neighbor Search", NeurIPS 2021.
+
} // namespace
std::unique_ptr<faiss::IDSelector> FaissVectorIndex::roaring_to_faiss_selector(
const roaring::Roaring& roaring) {
@@ -251,6 +290,148 @@ public:
lucene::store::IndexInput* _input = nullptr;
};
+/**
+ * RandomAccessReader backed by a CLucene IndexInput with per-range caching.
+ *
+ * Cache granularity matches the IVF access pattern exactly: one entry per
+ * list's codes region and one per list's ids region, keyed by
+ * (file_prefix, file_size, byte_offset). Every repeated borrow() on the
+ * same list is a guaranteed zero-copy cache hit.
+ *
+ * Thread-safety: cache lookups / inserts are lock-free (the LRU cache is
+ * internally sharded). Disk reads are serialised by _io_mutex because
+ * the CLucene IndexInput is stateful (seek + readBytes).
+ */
+struct CachedRandomAccessReader : faiss::RandomAccessReader {
+ CachedRandomAccessReader(lucene::store::IndexInput* input, std::string
cache_key_prefix,
+ size_t file_size)
+ : _input(input->clone()),
+ _cache_key_prefix(std::move(cache_key_prefix)),
+ _file_size(file_size) {
+ // Clear the inherited IOContext to prevent use-after-free on the
+ // caller's stale pointer. Per-read io_ctx is bound from
+ // g_current_io_ctx inside _read_clucene().
+ _input->setIoContext(nullptr);
+ }
+
+ ~CachedRandomAccessReader() override {
+ if (_input != nullptr) {
+ _input->close();
+ _CLDELETE(_input);
+ }
+ }
+
+ // ---- faiss::RandomAccessReader interface ----
+
+ void read_at(size_t offset, void* ptr, size_t nbytes) const override {
+ if (nbytes == 0) {
+ return;
+ }
+ auto ref = borrow(offset, nbytes);
+ DCHECK(ref != nullptr);
+ ::memcpy(ptr, ref->data(), nbytes);
+ }
+
+ std::unique_ptr<faiss::ReadRef> borrow(size_t offset, size_t nbytes) const
override {
+ if (nbytes == 0) {
+ return nullptr;
+ }
+
+ auto* cache = AnnIndexIVFListCache::instance();
+ if (!cache) {
+ return RandomAccessReader::borrow(offset, nbytes);
+ }
+
+ AnnIndexIVFListCache::CacheKey key(_cache_key_prefix, _file_size,
+ static_cast<int64_t>(offset));
+
+ // Fast path: cache hit — zero-copy, lock-free.
+ PageCacheHandle handle;
+ if (cache->lookup(key, &handle)) {
+ ++g_ivf_on_disk_cache_stats.hit_cnt;
+ return _make_pinned_ref(std::move(handle), nbytes);
+ }
+
+ // Slow path: cache miss — read from disk, then insert.
+ auto page = _fetch_from_disk(offset, nbytes, cache->mem_tracker());
+ cache->insert(key, page.get(), &handle);
+ page.release(); // ownership transferred to cache
+ return _make_pinned_ref(std::move(handle), nbytes);
+ }
+
+private:
+ // ---- ReadRef that pins a cache entry ----
+
+ struct PinnedReadRef : faiss::ReadRef {
+ explicit PinnedReadRef(PageCacheHandle handle, const uint8_t* ptr,
size_t len)
+ : _handle(std::move(handle)) {
+ data_ = ptr;
+ size_ = len;
+ }
+
+ private:
+ PageCacheHandle _handle;
+ };
+
+ static std::unique_ptr<faiss::ReadRef> _make_pinned_ref(PageCacheHandle
handle, size_t nbytes) {
+ Slice data = handle.data();
+ return std::make_unique<PinnedReadRef>(std::move(handle),
+ reinterpret_cast<const
uint8_t*>(data.data), nbytes);
+ }
+
+ // ---- Disk I/O + metrics ----
+
+ /// Read a region from CLucene, wrapped in a DataPage, and record fetch
metrics.
+ std::unique_ptr<DataPage> _fetch_from_disk(
+ size_t offset, size_t nbytes, std::shared_ptr<MemTrackerLimiter>
mem_tracker) const {
+ auto page = std::make_unique<DataPage>(nbytes, std::move(mem_tracker));
+
+ const int64_t start_ns = MonotonicNanos();
+ {
+ std::lock_guard<std::mutex> lock(_io_mutex);
+ _read_clucene(offset, page->data(), nbytes);
+ }
+ const int64_t cost_ns = MonotonicNanos() - start_ns;
+
+ ++g_ivf_on_disk_cache_stats.miss_cnt;
+ DorisMetrics::instance()->ann_ivf_on_disk_fetch_page_cnt->increment(1);
+
DorisMetrics::instance()->ann_ivf_on_disk_fetch_page_costs_ms->increment(
+ static_cast<int64_t>(static_cast<double>(cost_ns) / 1000.0));
+
+ return page;
+ }
+
+ /// Low-level CLucene sequential read. Must be called under _io_mutex.
+ void _read_clucene(size_t offset, char* buf, size_t nbytes) const {
+ // Temporarily bind the caller's IOContext so that file-cache
+ // statistics are correctly attributed to the current query.
+ _input->setIoContext(g_current_io_ctx);
+ _input->seek(static_cast<int64_t>(offset));
+ const size_t kMaxChunk =
static_cast<size_t>(std::numeric_limits<Int32>::max());
+ size_t done = 0;
+ while (done < nbytes) {
+ const size_t to_read = std::min(nbytes - done, kMaxChunk);
+ try {
+ _input->readBytes(reinterpret_cast<uint8_t*>(buf + done),
cast_set<Int32>(to_read));
+ } catch (const std::exception& e) {
+ _input->setIoContext(nullptr);
+ throw doris::Exception(doris::ErrorCode::IO_ERROR,
+ "CachedRandomAccessReader: read failed
at offset {}: {}",
+ offset + done, e.what());
+ }
+ done += to_read;
+ }
+ _input->setIoContext(nullptr);
+ }
+
+ // ---- Members ----
+
+ mutable std::mutex _io_mutex;
+ mutable lucene::store::IndexInput* _input = nullptr;
+ std::string _cache_key_prefix;
+ size_t _file_size;
+};
+
doris::Status FaissVectorIndex::train(Int64 n, const float* vec) {
DCHECK(vec != nullptr);
DCHECK(_index != nullptr);
@@ -386,8 +567,13 @@ void FaissVectorIndex::build(const FaissBuildParameter&
params) {
hnsw_index->hnsw.efConstruction = params.ef_construction;
_index = std::move(hnsw_index);
- } else if (params.index_type == FaissBuildParameter::IndexType::IVF) {
- set_type(AnnIndexType::IVF);
+ } else if (params.index_type == FaissBuildParameter::IndexType::IVF ||
+ params.index_type ==
FaissBuildParameter::IndexType::IVF_ON_DISK) {
+ if (params.index_type == FaissBuildParameter::IndexType::IVF) {
+ set_type(AnnIndexType::IVF);
+ } else {
+ set_type(AnnIndexType::IVF_ON_DISK);
+ }
std::unique_ptr<faiss::Index> ivf_index;
if (params.metric_type == FaissBuildParameter::MetricType::L2) {
_quantizer = std::make_unique<faiss::IndexFlat>(params.dim,
faiss::METRIC_L2);
@@ -452,6 +638,8 @@ void FaissVectorIndex::build(const FaissBuildParameter&
params) {
doris::Status FaissVectorIndex::ann_topn_search(const float* query_vec, int k,
const
segment_v2::IndexSearchParameters& params,
segment_v2::IndexSearchResult&
result) {
+ const int64_t cache_hit_cnt_before = g_ivf_on_disk_cache_stats.hit_cnt;
+ const int64_t cache_miss_cnt_before = g_ivf_on_disk_cache_stats.miss_cnt;
std::unique_ptr<float[]> distances_ptr;
std::unique_ptr<std::vector<faiss::idx_t>> labels_ptr;
{
@@ -488,7 +676,7 @@ doris::Status FaissVectorIndex::ann_topn_search(const
float* query_vec, int k,
param->bounded_queue = hnsw_params->bounded_queue;
param->sel = id_sel.get();
search_param.reset(param);
- } else if (_index_type == AnnIndexType::IVF) {
+ } else if (_index_type == AnnIndexType::IVF || _index_type ==
AnnIndexType::IVF_ON_DISK) {
const IVFSearchParameters* ivf_params = dynamic_cast<const
IVFSearchParameters*>(¶ms);
if (ivf_params == nullptr) {
return doris::Status::InvalidArgument(
@@ -504,7 +692,13 @@ doris::Status FaissVectorIndex::ann_topn_search(const
float* query_vec, int k,
{
SCOPED_RAW_TIMER(&result.engine_search_ns);
- _index->search(1, query_vec, k, distances, labels, search_param.get());
+ ScopedIoCtxBinding io_ctx_binding(params.io_ctx);
+ try {
+ _index->search(1, query_vec, k, distances, labels,
search_param.get());
+ } catch (faiss::FaissException& e) {
+ return doris::Status::RuntimeError("exception occurred during ann
topn search: {}",
+ e.what());
+ }
}
{
SCOPED_RAW_TIMER(&result.engine_convert_ns);
@@ -538,6 +732,11 @@ doris::Status FaissVectorIndex::ann_topn_search(const
float* query_vec, int k,
<< "Row ids size: " << result.row_ids->size()
<< ", roaring size: " << result.roaring->cardinality();
}
+ if (_index_type == AnnIndexType::IVF_ON_DISK) {
+ result.ivf_on_disk_cache_hit_cnt = g_ivf_on_disk_cache_stats.hit_cnt -
cache_hit_cnt_before;
+ result.ivf_on_disk_cache_miss_cnt =
+ g_ivf_on_disk_cache_stats.miss_cnt - cache_miss_cnt_before;
+ }
// distance/row_ids conversion above already timed via SCOPED_RAW_TIMER
return doris::Status::OK();
}
@@ -549,6 +748,8 @@ doris::Status FaissVectorIndex::ann_topn_search(const
float* query_vec, int k,
doris::Status FaissVectorIndex::range_search(const float* query_vec, const
float& radius,
const
segment_v2::IndexSearchParameters& params,
segment_v2::IndexSearchResult&
result) {
+ const int64_t cache_hit_cnt_before = g_ivf_on_disk_cache_stats.hit_cnt;
+ const int64_t cache_miss_cnt_before = g_ivf_on_disk_cache_stats.miss_cnt;
DCHECK(_index != nullptr);
DCHECK(query_vec != nullptr);
DCHECK(params.roaring != nullptr)
@@ -593,15 +794,23 @@ doris::Status FaissVectorIndex::range_search(const float*
query_vec, const float
{
// Engine search: FAISS range_search
SCOPED_RAW_TIMER(&result.engine_search_ns);
- if (_metric == AnnIndexMetric::L2) {
- if (radius <= 0) {
- _index->range_search(1, query_vec, 0.0f,
&native_search_result, search_param.get());
- } else {
- _index->range_search(1, query_vec, radius * radius,
&native_search_result,
+ ScopedIoCtxBinding io_ctx_binding(params.io_ctx);
+ try {
+ if (_metric == AnnIndexMetric::L2) {
+ if (radius <= 0) {
+ _index->range_search(1, query_vec, 0.0f,
&native_search_result,
+ search_param.get());
+ } else {
+ _index->range_search(1, query_vec, radius * radius,
&native_search_result,
+ search_param.get());
+ }
+ } else if (_metric == AnnIndexMetric::IP) {
+ _index->range_search(1, query_vec, radius,
&native_search_result,
search_param.get());
}
- } else if (_metric == AnnIndexMetric::IP) {
- _index->range_search(1, query_vec, radius, &native_search_result,
search_param.get());
+ } catch (faiss::FaissException& e) {
+ return doris::Status::RuntimeError("exception occurred during ann
range search: {}",
+ e.what());
}
}
@@ -702,41 +911,217 @@ doris::Status FaissVectorIndex::range_search(const
float* query_vec, const float
}
}
+ if (_index_type == AnnIndexType::IVF_ON_DISK) {
+ result.ivf_on_disk_cache_hit_cnt = g_ivf_on_disk_cache_stats.hit_cnt -
cache_hit_cnt_before;
+ result.ivf_on_disk_cache_miss_cnt =
+ g_ivf_on_disk_cache_stats.miss_cnt - cache_miss_cnt_before;
+ }
+
return Status::OK();
}
doris::Status FaissVectorIndex::save(lucene::store::Directory* dir) {
auto start_time = std::chrono::high_resolution_clock::now();
- lucene::store::IndexOutput* idx_output =
dir->createOutput(faiss_index_fila_name);
- auto writer = std::make_unique<FaissIndexWriter>(idx_output);
- faiss::write_index(_index.get(), writer.get());
+ if (_index_type == AnnIndexType::IVF_ON_DISK) {
+ // IVF_ON_DISK: write ivf data to a separate file, then write index
metadata.
+ //
+ // Why do we replace invlists here in save() instead of at build()
time?
+ // During build/train/add, IndexIVF needs a writable
ArrayInvertedLists to
+ // receive vectors via add_entries(). PreadInvertedLists inherits from
+ // ReadOnlyInvertedLists and does not support writes. The original
+ // OnDiskInvertedLists does support writes but requires mmap on a real
file,
+ // which is unavailable at build time (Directory is only passed to
save()).
+ // So the standard faiss pattern is: build in-memory with
ArrayInvertedLists,
+ // then convert to on-disk format at serialization time. The
replace_invlists
+ // call in Step 2 is purely a serialization format switch (to emit
"ilod"
+ // fourcc instead of "ilar"), not a runtime data structure change.
+ //
+ // Step 1: The in-memory index has ArrayInvertedLists. Write them to
ann.ivfdata
+ // by converting to OnDiskInvertedLists format:
+ // For each list: [codes: capacity*code_size][ids:
capacity*sizeof(idx_t)]
+ auto* ivf = dynamic_cast<faiss::IndexIVF*>(_index.get());
+ DCHECK(ivf != nullptr);
+ auto* ails = dynamic_cast<faiss::ArrayInvertedLists*>(ivf->invlists);
+ DCHECK(ails != nullptr);
+
+ const size_t nlist = ails->nlist;
+ const size_t code_size = ails->code_size;
+
+ // Build OnDiskOneList metadata and write data to ann.ivfdata
+ std::vector<faiss::OnDiskOneList> lists(nlist);
+ lucene::store::IndexOutput* ivfdata_output =
dir->createOutput(faiss_ivfdata_file_name);
+ size_t offset = 0;
+ for (size_t i = 0; i < nlist; i++) {
+ size_t list_size = ails->list_size(i);
+ lists[i].size = list_size;
+ lists[i].capacity = list_size;
+ lists[i].offset = offset;
+
+ if (list_size > 0) {
+ // Write codes
+ const uint8_t* codes = ails->get_codes(i);
+ size_t codes_bytes = list_size * code_size;
+ ivfdata_output->writeBytes(reinterpret_cast<const
uint8_t*>(codes),
+ cast_set<Int32>(codes_bytes));
+
+ // Write ids
+ const faiss::idx_t* ids = ails->get_ids(i);
+ size_t ids_bytes = list_size * sizeof(faiss::idx_t);
+ ivfdata_output->writeBytes(reinterpret_cast<const
uint8_t*>(ids),
+ cast_set<Int32>(ids_bytes));
+ }
+
+ offset += list_size * (code_size + sizeof(faiss::idx_t));
+ }
+ size_t totsize = offset;
+ ivfdata_output->close();
+ delete ivfdata_output;
+
+ // Step 2: Replace ArrayInvertedLists with OnDiskInvertedLists so that
+ // write_index serializes in "ilod" format (metadata only).
+ auto* od = new faiss::OnDiskInvertedLists();
+ od->nlist = nlist;
+ od->code_size = code_size;
+ od->lists = std::move(lists);
+ od->totsize = totsize;
+ od->ptr = nullptr;
+ od->read_only = true;
+ // filename is not used during load (we use separate ivfdata file),
+ // but write it for format completeness.
+ od->filename = faiss_ivfdata_file_name;
+ ivf->replace_invlists(od, true);
+
+ // Step 3: Write index metadata to ann.faiss (includes "ilod" fourcc)
+ lucene::store::IndexOutput* idx_output =
dir->createOutput(faiss_index_fila_name);
+ auto writer = std::make_unique<FaissIndexWriter>(idx_output);
+ faiss::write_index(_index.get(), writer.get());
+
+ } else {
+ // HNSW / IVF: write the full index to ann.faiss
+ lucene::store::IndexOutput* idx_output =
dir->createOutput(faiss_index_fila_name);
+ auto writer = std::make_unique<FaissIndexWriter>(idx_output);
+ faiss::write_index(_index.get(), writer.get());
+ }
auto end_time = std::chrono::high_resolution_clock::now();
auto duration =
std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);
LOG_INFO(fmt::format("Faiss index saved to {}, {}, rows {}, cost {} ms",
dir->toString(),
faiss_index_fila_name, _index->ntotal,
duration.count()));
+
return doris::Status::OK();
}
doris::Status FaissVectorIndex::load(lucene::store::Directory* dir) {
auto start_time = std::chrono::high_resolution_clock::now();
- lucene::store::IndexInput* idx_input = nullptr;
- try {
- idx_input = dir->openInput(faiss_index_fila_name);
- } catch (const CLuceneError& e) {
- return
doris::Status::Error<doris::ErrorCode::INVERTED_INDEX_CLUCENE_ERROR>(
- "Failed to open index file: {}, error: {}",
faiss_index_fila_name, e.what());
+
+ if (_index_type == AnnIndexType::IVF_ON_DISK) {
+ // IVF_ON_DISK load:
+ // 1. Read index metadata from ann.faiss with IO_FLAG_SKIP_IVF_DATA.
+ // This reads "ilod" metadata (lists[], slots[], etc.) without mmap.
+ // 2. Replace the OnDiskInvertedLists with PreadInvertedLists.
+ // 3. Open ann.ivfdata via CLucene IndexInput and bind as
RandomAccessReader.
+ lucene::store::IndexInput* idx_input = nullptr;
+ try {
+ idx_input = dir->openInput(faiss_index_fila_name);
+ } catch (const CLuceneError& e) {
+ return
doris::Status::Error<doris::ErrorCode::INVERTED_INDEX_CLUCENE_ERROR>(
+ "Failed to open index file: {}, error: {}",
faiss_index_fila_name, e.what());
+ }
+
+ auto reader = std::make_unique<FaissIndexReader>(idx_input);
+ // IO_FLAG_SKIP_IVF_DATA: read only index metadata (quantizer, PQ
codebook,
+ // inverted-list slot table) without loading the bulk inverted-list
data.
+ //
+ // IO_FLAG_SKIP_PRECOMPUTE_TABLE: skip rebuilding the IVFPQ precomputed
+ // distance table. The table is NOT serialized on disk; FAISS
recomputes
+ // it from quantizer centroids and PQ codebook on every read_index()
call
+ // (see index_read.cpp read_ivfpq(), IndexIVFPQ.cpp
precompute_table()).
+ //
+ // Per-segment table memory:
+ // logical_elems = nlist * pq.M * pq.ksub (floats)
+ // logical_bytes = logical_elems * sizeof(float) (bytes)
+ // actual_bytes = next_power_of_2(logical_elems) * sizeof(float)
+ // (AlignedTable rounds up to next power of two)
+ //
+ // Concrete example (nlist=1024, pq_m=384, pq_nbits=8,
ksub=2^8=256):
+ // logical_elems = 1024 * 384 * 256 = 100,663,296
+ // logical_bytes = 100,663,296 * 4 = 402,653,184 (384 MiB)
+ // next_pow2 = 2^27 = 134,217,728
+ // actual_bytes = 134,217,728 * 4 = 536,870,912 (512 MiB per
segment)
+ //
+ // With 146 segments loaded simultaneously:
+ // total = 146 * 512 MiB = 74,752 MiB (~73 GiB untracked RSS)
+ //
+ // FAISS has a per-table guard (precomputed_table_max_bytes = 2 GiB)
but
+ // it checks only the single-table logical size (384 MiB << 2 GiB),
so it
+ // cannot prevent the multi-segment accumulation.
+ //
+ // For IVF_ON_DISK the search bottleneck is disk/network I/O, not
per-list
+ // CPU distance computation. Skipping the table makes search fall
back to
+ // on-the-fly compute_residual() + compute_distance_table() per
(query,
+ // inverted-list) pair, which is functionally identical and correct.
+ constexpr int kIVFOnDiskReadFlags =
+ faiss::IO_FLAG_SKIP_IVF_DATA |
faiss::IO_FLAG_SKIP_PRECOMPUTE_TABLE;
+ faiss::Index* idx = faiss::read_index(reader.get(),
kIVFOnDiskReadFlags);
+
+ // Replace OnDiskInvertedLists (metadata-only) with PreadInvertedLists
+ faiss::PreadInvertedLists* pread =
faiss::replace_with_pread_invlists(idx);
+
+ // Open ann.ivfdata and bind the cached random access reader
+ lucene::store::IndexInput* ivfdata_input = nullptr;
+ try {
+ ivfdata_input = dir->openInput(faiss_ivfdata_file_name);
+ } catch (const CLuceneError& e) {
+ delete idx;
+ return
doris::Status::Error<doris::ErrorCode::INVERTED_INDEX_CLUCENE_ERROR>(
+ "Failed to open ivfdata file: {}, error: {}",
faiss_ivfdata_file_name,
+ e.what());
+ }
+ const size_t ivfdata_size =
static_cast<size_t>(ivfdata_input->length());
+ pread->set_reader(std::make_unique<CachedRandomAccessReader>(
+ ivfdata_input, _ivfdata_cache_key_prefix, ivfdata_size));
+
+ // Close the original input (CachedRandomAccessReader cloned it)
+ ivfdata_input->close();
+ _CLDELETE(ivfdata_input);
+
+ auto end_time = std::chrono::high_resolution_clock::now();
+ auto duration =
+ std::chrono::duration_cast<std::chrono::milliseconds>(end_time
- start_time);
+ VLOG_DEBUG << fmt::format("Load IVF_ON_DISK index from {} costs {} ms,
rows {}",
+ dir->getObjectName(), duration.count(),
idx->ntotal);
+ _index.reset(idx);
+
DorisMetrics::instance()->ann_index_in_memory_rows_cnt->increment(_index->ntotal);
+ } else {
+ // HNSW / IVF: load the full index from ann.faiss
+ lucene::store::IndexInput* idx_input = nullptr;
+ try {
+ idx_input = dir->openInput(faiss_index_fila_name);
+ } catch (const CLuceneError& e) {
+ return
doris::Status::Error<doris::ErrorCode::INVERTED_INDEX_CLUCENE_ERROR>(
+ "Failed to open index file: {}, error: {}",
faiss_index_fila_name, e.what());
+ }
+
+ auto reader = std::make_unique<FaissIndexReader>(idx_input);
+ // IO_FLAG_SKIP_PRECOMPUTE_TABLE: skip rebuilding IVFPQ precomputed
+ // distance tables for in-memory IVF. Same rationale as IVF_ON_DISK –
+ // each segment's table can be hundreds of MiBs (see calculation
above),
+ // and the per-list on-the-fly compute_residual() +
compute_distance_table()
+ // is negligible compared to overall search cost.
+ // The flag is a no-op for HNSW (no IVF code path is reached).
+ const int read_flags =
+ (_index_type == AnnIndexType::IVF) ?
faiss::IO_FLAG_SKIP_PRECOMPUTE_TABLE : 0;
+ faiss::Index* idx = faiss::read_index(reader.get(), read_flags);
+ auto end_time = std::chrono::high_resolution_clock::now();
+ auto duration =
+ std::chrono::duration_cast<std::chrono::milliseconds>(end_time
- start_time);
+ VLOG_DEBUG << fmt::format("Load index from {} costs {} ms, rows {}",
dir->getObjectName(),
+ duration.count(), idx->ntotal);
+ _index.reset(idx);
+
DorisMetrics::instance()->ann_index_in_memory_rows_cnt->increment(_index->ntotal);
}
- auto reader = std::make_unique<FaissIndexReader>(idx_input);
- faiss::Index* idx = faiss::read_index(reader.get());
- auto end_time = std::chrono::high_resolution_clock::now();
- auto duration =
std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);
- VLOG_DEBUG << fmt::format("Load index from {} costs {} ms, rows {}",
dir->getObjectName(),
- duration.count(), idx->ntotal);
- _index.reset(idx);
-
DorisMetrics::instance()->ann_index_in_memory_rows_cnt->increment(_index->ntotal);
return doris::Status::OK();
}
diff --git a/be/src/storage/index/ann/faiss_ann_index.h
b/be/src/storage/index/ann/faiss_ann_index.h
index 5560c6fc2d6..946810907d4 100644
--- a/be/src/storage/index/ann/faiss_ann_index.h
+++ b/be/src/storage/index/ann/faiss_ann_index.h
@@ -49,8 +49,9 @@ struct FaissBuildParameter {
* @brief Supported vector index types.
*/
enum class IndexType {
- HNSW, ///< Hierarchical Navigable Small World (HNSW) index for high
performance
- IVF ///< Inverted File index
+ HNSW, ///< Hierarchical Navigable Small World (HNSW) index for
high performance
+ IVF, ///< Inverted File index (in-memory)
+ IVF_ON_DISK ///< Inverted File index with on-disk inverted lists
};
/**
@@ -79,6 +80,8 @@ struct FaissBuildParameter {
return IndexType::HNSW;
} else if (type == "ivf") {
return IndexType::IVF;
+ } else if (type == "ivf_on_disk") {
+ return IndexType::IVF_ON_DISK;
} else {
throw doris::Exception(doris::ErrorCode::INVALID_ARGUMENT,
"Unsupported index type: {}",
type);
@@ -284,10 +287,21 @@ public:
*/
doris::Status load(lucene::store::Directory*) override;
+ /**
+ * @brief Sets a prefix string used as the cache key for ivfdata blocks.
+ *
+ * This must be called before load() when the index type is IVF_ON_DISK.
+ * The prefix should uniquely identify the index file (e.g. segment path).
+ */
+ void set_ivfdata_cache_key_prefix(std::string prefix) {
+ _ivfdata_cache_key_prefix = std::move(prefix);
+ }
+
private:
std::unique_ptr<faiss::Index> _index = nullptr; ///< Underlying FAISS
index instance
std::unique_ptr<faiss::Index> _quantizer = nullptr;
- FaissBuildParameter _params; ///< Build parameters for the index
+ FaissBuildParameter _params; ///< Build parameters for the index
+ std::string _ivfdata_cache_key_prefix; ///< Cache key prefix for ivfdata
blocks
};
#include "common/compile_check_end.h"
} // namespace doris::segment_v2
diff --git a/be/src/storage/index/index_file_reader.h
b/be/src/storage/index/index_file_reader.h
index 72617446580..fb4ec2b9a62 100644
--- a/be/src/storage/index/index_file_reader.h
+++ b/be/src/storage/index/index_file_reader.h
@@ -42,6 +42,9 @@ namespace segment_v2 {
class ReaderFileEntry;
class DorisCompoundReader;
+// A singleton class responsible for reading index files, managing file
entries, and providing interfaces to access index data.
+// The singleton object is at segment level, and it is shared by all threads
that read the same segment (even across different queries).
+// It is created when the first index reader is initialized, and destroyed
when the segment is closed.
class IndexFileReader {
public:
// Modern C++ using std::unordered_map with smart pointers for automatic
memory management
diff --git a/be/src/storage/index/index_file_writer.cpp
b/be/src/storage/index/index_file_writer.cpp
index 2d9c4014418..afd09c84620 100644
--- a/be/src/storage/index/index_file_writer.cpp
+++ b/be/src/storage/index/index_file_writer.cpp
@@ -145,7 +145,14 @@ Status IndexFileWriter::add_into_searcher_cache() {
auto dir = DORIS_TRY(index_file_reader->_open(index_meta.first,
index_meta.second));
std::vector<std::string> file_names;
dir->list(&file_names);
- if (file_names.size() == 1 && (file_names[0] ==
faiss_index_fila_name)) {
+ // Skip ANN indexes – they use FAISS files (ann.faiss, ann.ivfdata)
instead of
+ // CLucene segments, so building an inverted-index searcher would fail.
+ // HNSW/IVF produces 1 file (ann.faiss); IVF_ON_DISK produces 2
(ann.faiss + ann.ivfdata).
+ bool is_ann_index =
+ std::any_of(file_names.begin(), file_names.end(), [](const
std::string& f) {
+ return f == faiss_index_fila_name || f ==
faiss_ivfdata_file_name;
+ });
+ if (is_ann_index) {
continue;
}
auto index_file_key =
InvertedIndexDescriptor::get_index_file_cache_key(
diff --git a/be/src/storage/olap_common.h b/be/src/storage/olap_common.h
index d1df2059c75..e09146d0cde 100644
--- a/be/src/storage/olap_common.h
+++ b/be/src/storage/olap_common.h
@@ -380,6 +380,9 @@ struct OlapReaderStatistics {
int64_t ann_index_load_ns = 0;
int64_t ann_topn_search_ns = 0;
int64_t ann_index_topn_search_cnt = 0;
+ int64_t ann_ivf_on_disk_load_ns = 0;
+ int64_t ann_ivf_on_disk_cache_hit_cnt = 0;
+ int64_t ann_ivf_on_disk_cache_miss_cnt = 0;
// Detailed timing for ANN operations
int64_t ann_index_topn_engine_search_ns = 0; // time spent in engine for
range search
diff --git a/be/src/storage/segment/segment_iterator.cpp
b/be/src/storage/segment/segment_iterator.cpp
index 31eb4f07725..d1587feba9a 100644
--- a/be/src/storage/segment/segment_iterator.cpp
+++ b/be/src/storage/segment/segment_iterator.cpp
@@ -957,6 +957,10 @@ Status SegmentIterator::_apply_ann_topn_predicate() {
_opts.stats->rows_ann_index_topn_filtered += rows_filterd;
_opts.stats->ann_index_load_ns +=
ann_index_stats.load_index_costs_ns.value();
_opts.stats->ann_topn_search_ns += ann_index_stats.search_costs_ns.value();
+ _opts.stats->ann_ivf_on_disk_load_ns +=
ann_index_stats.ivf_on_disk_load_costs_ns.value();
+ _opts.stats->ann_ivf_on_disk_cache_hit_cnt +=
ann_index_stats.ivf_on_disk_cache_hit_cnt.value();
+ _opts.stats->ann_ivf_on_disk_cache_miss_cnt +=
+ ann_index_stats.ivf_on_disk_cache_miss_cnt.value();
_opts.stats->ann_index_topn_engine_search_ns +=
ann_index_stats.engine_search_ns.value();
_opts.stats->ann_index_topn_result_process_ns +=
ann_index_stats.result_process_costs_ns.value();
@@ -1211,6 +1215,11 @@ Status SegmentIterator::_apply_index_expr() {
_opts.stats->rows_ann_index_range_filtered += (origin_rows -
_row_bitmap.cardinality());
_opts.stats->ann_index_load_ns +=
ann_index_stats.load_index_costs_ns.value();
_opts.stats->ann_index_range_search_ns +=
ann_index_stats.search_costs_ns.value();
+ _opts.stats->ann_ivf_on_disk_load_ns +=
ann_index_stats.ivf_on_disk_load_costs_ns.value();
+ _opts.stats->ann_ivf_on_disk_cache_hit_cnt +=
+ ann_index_stats.ivf_on_disk_cache_hit_cnt.value();
+ _opts.stats->ann_ivf_on_disk_cache_miss_cnt +=
+ ann_index_stats.ivf_on_disk_cache_miss_cnt.value();
_opts.stats->ann_range_engine_search_ns +=
ann_index_stats.engine_search_ns.value();
_opts.stats->ann_range_result_convert_ns +=
ann_index_stats.result_process_costs_ns.value();
_opts.stats->ann_range_engine_convert_ns +=
ann_index_stats.engine_convert_ns.value();
diff --git a/conf/ubsan_ignorelist.txt b/conf/ubsan_ignorelist.txt
index 2b4998bad76..9fca0e59cdd 100644
--- a/conf/ubsan_ignorelist.txt
+++ b/conf/ubsan_ignorelist.txt
@@ -1,4 +1,10 @@
src:*/rapidjson/*
+# OpenBLAS blas_server_omp.c:exec_threads calls inner_thread through a
+# function pointer whose signature doesn't match the actual definition
+# (level3_thread.c:219). This is a known upstream issue in OpenBLAS and
+# is harmless on x86_64 (compatible calling convention), but triggers
+# UBSan's -fsanitize=function check.
+src:*/contrib/openblas/*
src:*/storage/key_coder.h
src:*/storage/types.h
src:*/storage/field.h
diff --git a/contrib/faiss b/contrib/faiss
index 0276f0bb7e4..032afe95f67 160000
--- a/contrib/faiss
+++ b/contrib/faiss
@@ -1 +1 @@
-Subproject commit 0276f0bb7e4fc0694fe44f450f4f5cfbcb4dd5cc
+Subproject commit 032afe95f671cd50b82d52d901345600776d7855
diff --git
a/fe/fe-core/src/main/java/org/apache/doris/analysis/AnnIndexPropertiesChecker.java
b/fe/fe-core/src/main/java/org/apache/doris/analysis/AnnIndexPropertiesChecker.java
index f03d04a7098..7a887df03ba 100644
---
a/fe/fe-core/src/main/java/org/apache/doris/analysis/AnnIndexPropertiesChecker.java
+++
b/fe/fe-core/src/main/java/org/apache/doris/analysis/AnnIndexPropertiesChecker.java
@@ -34,8 +34,9 @@ public class AnnIndexPropertiesChecker {
switch (key) {
case "index_type":
type = properties.get(key);
- if (!type.equals("hnsw") && !type.equals("ivf")) {
- throw new AnalysisException("only support ann index
with type hnsw or ivf, got: " + type);
+ if (!type.equals("hnsw") && !type.equals("ivf") &&
!type.equals("ivf_on_disk")) {
+ throw new AnalysisException(
+ "only support ann index with type hnsw, ivf or
ivf_on_disk, got: " + type);
}
break;
case "metric_type":
@@ -138,9 +139,9 @@ public class AnnIndexPropertiesChecker {
}
}
- if (type != null && type.equals("ivf")) {
+ if (type != null && (type.equals("ivf") ||
type.equals("ivf_on_disk"))) {
if (nlist == 0) {
- throw new AnalysisException("nlist of ann index must be
specified for ivf type");
+ throw new AnalysisException("nlist of ann index must be
specified for ivf/ivf_on_disk type");
}
}
diff --git a/fe/fe-core/src/main/java/org/apache/doris/qe/SessionVariable.java
b/fe/fe-core/src/main/java/org/apache/doris/qe/SessionVariable.java
index f5c80600178..8345f430c90 100644
--- a/fe/fe-core/src/main/java/org/apache/doris/qe/SessionVariable.java
+++ b/fe/fe-core/src/main/java/org/apache/doris/qe/SessionVariable.java
@@ -3470,7 +3470,7 @@ public class SessionVariable implements Serializable,
Writable {
@VariableMgr.VarAttr(name = IVF_NPROBE, needForward = true,
description = {"IVF 索引的 nprobe 参数,控制搜索时访问的聚类数量",
"IVF index nprobe parameter, controls the number of
clusters to search"})
- public int ivfNprobe = 1;
+ public int ivfNprobe = 32;
@VariableMgr.VarAttr(
name = DEFAULT_VARIANT_MAX_SUBCOLUMNS_COUNT,
diff --git a/gensrc/thrift/PaloInternalService.thrift
b/gensrc/thrift/PaloInternalService.thrift
index f2da2243ff5..c33c7565a27 100644
--- a/gensrc/thrift/PaloInternalService.thrift
+++ b/gensrc/thrift/PaloInternalService.thrift
@@ -418,7 +418,7 @@ struct TQueryOptions {
179: optional bool enable_parquet_filter_by_bloom_filter = true;
180: optional i32 max_file_scanners_concurrency = 0;
181: optional i32 min_file_scanners_concurrency = 0;
- 182: optional i32 ivf_nprobe = 1;
+ 182: optional i32 ivf_nprobe = 32;
// Enable hybrid sorting: dynamically selects between PdqSort and TimSort
based on
// runtime profiling to choose the most efficient algorithm for the data
pattern
183: optional bool enable_use_hybrid_sort = false;
diff --git a/regression-test/data/ann_index_p0/ivf_on_disk_index_test.out
b/regression-test/data/ann_index_p0/ivf_on_disk_index_test.out
new file mode 100644
index 00000000000..ffeeb85e133
--- /dev/null
+++ b/regression-test/data/ann_index_p0/ivf_on_disk_index_test.out
@@ -0,0 +1,29 @@
+-- This file is automatically generated. You should know what you did if you
want to edit this
+-- !sql --
+1 [1, 2, 3]
+2 [0.5, 2.1, 2.9]
+3 [10, 10, 10]
+4 [20, 20, 20]
+5 [50, 20, 20]
+6 [60, 20, 20]
+
+-- !sql --
+1 [1, 2, 3]
+2 [0.5, 2.1, 2.9]
+3 [10, 10, 10]
+4 [20, 20, 20]
+5 [50, 20, 20]
+6 [60, 20, 20]
+
+-- !sql --
+1 [1, 2, 3]
+2 [0.5, 2.1, 2.9]
+3 [10, 10, 10]
+4 [20, 20, 20]
+5 [50, 20, 20]
+6 [60, 20, 20]
+7 [5, 5, 5]
+8 [100, 100, 100]
+9 [0, 0, 0]
+10 [30, 30, 30]
+
diff --git a/regression-test/data/ann_index_p0/ivf_on_disk_stream_load.csv
b/regression-test/data/ann_index_p0/ivf_on_disk_stream_load.csv
new file mode 100644
index 00000000000..df86b8021d1
--- /dev/null
+++ b/regression-test/data/ann_index_p0/ivf_on_disk_stream_load.csv
@@ -0,0 +1,6 @@
+1|[1.0,2.0,3.0]
+2|[0.5,2.1,2.9]
+3|[10.0,10.0,10.0]
+4|[20.0,20.0,20.0]
+5|[50.0,20.0,20.0]
+6|[60.0,20.0,20.0]
diff --git a/regression-test/data/ann_index_p0/ivf_on_disk_stream_load.json
b/regression-test/data/ann_index_p0/ivf_on_disk_stream_load.json
new file mode 100644
index 00000000000..411fd5eadad
--- /dev/null
+++ b/regression-test/data/ann_index_p0/ivf_on_disk_stream_load.json
@@ -0,0 +1,8 @@
+[
+ {"id": 1, "embedding": [1.0, 2.0, 3.0]},
+ {"id": 2, "embedding": [0.5, 2.1, 2.9]},
+ {"id": 3, "embedding": [10.0, 10.0, 10.0]},
+ {"id": 4, "embedding": [20.0, 20.0, 20.0]},
+ {"id": 5, "embedding": [50.0, 20.0, 20.0]},
+ {"id": 6, "embedding": [60.0, 20.0, 20.0]}
+]
diff --git a/regression-test/suites/ann_index_p0/create_ann_index_test.groovy
b/regression-test/suites/ann_index_p0/create_ann_index_test.groovy
index 2ceaf61f6c7..17f542de168 100644
--- a/regression-test/suites/ann_index_p0/create_ann_index_test.groovy
+++ b/regression-test/suites/ann_index_p0/create_ann_index_test.groovy
@@ -190,7 +190,7 @@ suite("create_ann_index_test") {
"replication_num" = "1"
);
"""
- exception "only support ann index with type hnsw or ivf"
+ exception "only support ann index with type hnsw, ivf or ivf_on_disk"
}
// metric_type is incorrect
@@ -386,4 +386,26 @@ suite("create_ann_index_test") {
exception "ANN index is not supported in index format V1"
}
+
+ // CREATE INDEX with ivf_on_disk type on existing table
+ sql "drop table if exists tbl_ann_ivf_on_disk_create_idx"
+ sql """
+ CREATE TABLE tbl_ann_ivf_on_disk_create_idx (
+ id INT NOT NULL COMMENT "",
+ embedding ARRAY<FLOAT> NOT NULL COMMENT ""
+ ) ENGINE=OLAP
+ DUPLICATE KEY(id) COMMENT "OLAP"
+ DISTRIBUTED BY HASH(id) BUCKETS 2
+ PROPERTIES (
+ "replication_num" = "1"
+ );
+ """
+ sql """
+ CREATE INDEX idx_ivf_on_disk ON tbl_ann_ivf_on_disk_create_idx
(`embedding`) USING ANN PROPERTIES(
+ "index_type"="ivf_on_disk",
+ "metric_type"="l2_distance",
+ "dim"="128",
+ "nlist"="64"
+ );
+ """
}
diff --git
a/regression-test/suites/ann_index_p0/create_tbl_with_ann_index_test.groovy
b/regression-test/suites/ann_index_p0/create_tbl_with_ann_index_test.groovy
index 5677dc69789..35725cd4b46 100644
--- a/regression-test/suites/ann_index_p0/create_tbl_with_ann_index_test.groovy
+++ b/regression-test/suites/ann_index_p0/create_tbl_with_ann_index_test.groovy
@@ -294,4 +294,66 @@ suite("create_tbl_with_ann_index_test") {
);
"""
+ // 成功创建 IVF_ON_DISK 索引 - l2_distance
+ sql "drop table if exists ann_tbl15"
+ sql """
+ CREATE TABLE ann_tbl15 (
+ id INT NOT NULL COMMENT "",
+ vec ARRAY<FLOAT> NOT NULL COMMENT "",
+ INDEX ann_idx15 (vec) USING ANN PROPERTIES(
+ "index_type" = "ivf_on_disk",
+ "metric_type" = "l2_distance",
+ "dim" = "128",
+ "nlist" = "128"
+ )
+ ) ENGINE=OLAP
+ DUPLICATE KEY(id) COMMENT "OLAP"
+ DISTRIBUTED BY HASH(id) BUCKETS 2
+ PROPERTIES (
+ "replication_num" = "1"
+ );
+ """
+
+ // 成功创建 IVF_ON_DISK 索引 - inner_product
+ sql "drop table if exists ann_tbl16"
+ sql """
+ CREATE TABLE ann_tbl16 (
+ id INT NOT NULL COMMENT "",
+ vec ARRAY<FLOAT> NOT NULL COMMENT "",
+ INDEX ann_idx16 (vec) USING ANN PROPERTIES(
+ "index_type" = "ivf_on_disk",
+ "metric_type" = "inner_product",
+ "dim" = "128",
+ "nlist" = "64"
+ )
+ ) ENGINE=OLAP
+ DUPLICATE KEY(id) COMMENT "OLAP"
+ DISTRIBUTED BY HASH(id) BUCKETS 2
+ PROPERTIES (
+ "replication_num" = "1"
+ );
+ """
+
+ // IVF_ON_DISK 缺少 nlist
+ sql "drop table if exists ann_tbl17"
+ test {
+ sql """
+ CREATE TABLE ann_tbl17 (
+ id INT NOT NULL COMMENT "",
+ vec ARRAY<FLOAT> NOT NULL COMMENT "",
+ INDEX ann_idx17 (vec) USING ANN PROPERTIES(
+ "index_type" = "ivf_on_disk",
+ "metric_type" = "l2_distance",
+ "dim" = "128"
+ )
+ ) ENGINE=OLAP
+ DUPLICATE KEY(id) COMMENT "OLAP"
+ DISTRIBUTED BY HASH(id) BUCKETS 2
+ PROPERTIES (
+ "replication_num" = "1"
+ );
+ """
+ exception "nlist of ann index must be specified for ivf/ivf_on_disk
type"
+ }
+
}
diff --git a/regression-test/suites/ann_index_p0/ivf_index_test.groovy
b/regression-test/suites/ann_index_p0/ivf_index_test.groovy
index 767cad27fcd..c806eed3068 100644
--- a/regression-test/suites/ann_index_p0/ivf_index_test.groovy
+++ b/regression-test/suites/ann_index_p0/ivf_index_test.groovy
@@ -66,7 +66,7 @@ suite ("ivf_index_test") {
DISTRIBUTED BY HASH(id) BUCKETS 1
PROPERTIES ("replication_num" = "1");
"""
- exception """nlist of ann index must be specified for ivf type"""
+ exception """nlist of ann index must be specified for ivf/ivf_on_disk
type"""
}
sql """
diff --git a/regression-test/suites/ann_index_p0/ivf_on_disk_index_test.groovy
b/regression-test/suites/ann_index_p0/ivf_on_disk_index_test.groovy
new file mode 100644
index 00000000000..ac77fa1c60b
--- /dev/null
+++ b/regression-test/suites/ann_index_p0/ivf_on_disk_index_test.groovy
@@ -0,0 +1,231 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+suite ("ivf_on_disk_index_test") {
+ sql "set enable_common_expr_pushdown=true;"
+
+ // ========== IVF_ON_DISK with L2 distance ==========
+ sql "drop table if exists tbl_ivf_on_disk_l2"
+ sql """
+ CREATE TABLE tbl_ivf_on_disk_l2 (
+ id INT NOT NULL,
+ embedding ARRAY<FLOAT> NOT NULL,
+ INDEX idx_emb (`embedding`) USING ANN PROPERTIES(
+ "index_type"="ivf_on_disk",
+ "metric_type"="l2_distance",
+ "nlist"="3",
+ "dim"="3"
+ )
+ ) ENGINE=OLAP
+ DUPLICATE KEY(id)
+ DISTRIBUTED BY HASH(id) BUCKETS 1
+ PROPERTIES ("replication_num" = "1");
+ """
+
+ sql """
+ INSERT INTO tbl_ivf_on_disk_l2 VALUES
+ (1, [1.0, 2.0, 3.0]),
+ (2, [0.5, 2.1, 2.9]),
+ (3, [10.0, 10.0, 10.0]),
+ (4, [20.0, 20.0, 20.0]),
+ (5, [50.0, 20.0, 20.0]),
+ (6, [60.0, 20.0, 20.0]);
+ """
+ qt_sql "select * from tbl_ivf_on_disk_l2;"
+ // approximate search with l2_distance
+ sql "select id, l2_distance_approximate(embedding, [1.0,2.0,3.0]) as dist
from tbl_ivf_on_disk_l2 order by dist limit 2;"
+
+ // ========== Error: missing nlist for ivf_on_disk ==========
+ sql "drop table if exists tbl_ivf_on_disk_l2"
+ test {
+ sql """
+ CREATE TABLE tbl_ivf_on_disk_l2 (
+ id INT NOT NULL,
+ embedding ARRAY<FLOAT> NOT NULL,
+ INDEX idx_emb (`embedding`) USING ANN PROPERTIES(
+ "index_type"="ivf_on_disk",
+ "metric_type"="l2_distance",
+ "dim"="3"
+ )
+ ) ENGINE=OLAP
+ DUPLICATE KEY(id)
+ DISTRIBUTED BY HASH(id) BUCKETS 1
+ PROPERTIES ("replication_num" = "1");
+ """
+ exception """nlist of ann index must be specified for ivf/ivf_on_disk
type"""
+ }
+
+ // ========== Error: not enough training points ==========
+ sql """
+ CREATE TABLE tbl_ivf_on_disk_l2 (
+ id INT NOT NULL,
+ embedding ARRAY<FLOAT> NOT NULL,
+ INDEX idx_emb (`embedding`) USING ANN PROPERTIES(
+ "index_type"="ivf_on_disk",
+ "metric_type"="l2_distance",
+ "nlist"="3",
+ "dim"="3"
+ )
+ ) ENGINE=OLAP
+ DUPLICATE KEY(id)
+ DISTRIBUTED BY HASH(id) BUCKETS 1
+ PROPERTIES ("replication_num" = "1");
+ """
+ test {
+ sql """
+ INSERT INTO tbl_ivf_on_disk_l2 VALUES
+ (1, [1.0, 2.0, 3.0]),
+ (2, [0.5, 2.1, 2.9]);
+ """
+ exception """exception occurred during training"""
+ }
+
+ // ========== IVF_ON_DISK with inner product ==========
+ sql "drop table if exists tbl_ivf_on_disk_ip"
+ sql """
+ CREATE TABLE tbl_ivf_on_disk_ip (
+ id INT NOT NULL,
+ embedding ARRAY<FLOAT> NOT NULL,
+ INDEX idx_emb (`embedding`) USING ANN PROPERTIES(
+ "index_type"="ivf_on_disk",
+ "metric_type"="inner_product",
+ "nlist"="3",
+ "dim"="3"
+ )
+ ) ENGINE=OLAP
+ DUPLICATE KEY(id)
+ DISTRIBUTED BY HASH(id) BUCKETS 1
+ PROPERTIES ("replication_num" = "1");
+ """
+
+ sql """
+ INSERT INTO tbl_ivf_on_disk_ip VALUES
+ (1, [1.0, 2.0, 3.0]),
+ (2, [0.5, 2.1, 2.9]),
+ (3, [10.0, 10.0, 10.0]),
+ (4, [20.0, 20.0, 20.0]),
+ (5, [50.0, 20.0, 20.0]),
+ (6, [60.0, 20.0, 20.0]);
+ """
+ qt_sql "select * from tbl_ivf_on_disk_ip;"
+ // approximate search with inner_product
+ sql "select id, inner_product_approximate(embedding, [1.0,2.0,3.0]) as
dist from tbl_ivf_on_disk_ip order by dist desc limit 2;"
+
+ // ========== IVF_ON_DISK with stream load ==========
+ sql "drop table if exists tbl_ivf_on_disk_stream_load"
+ sql """
+ CREATE TABLE tbl_ivf_on_disk_stream_load (
+ id INT NOT NULL,
+ embedding ARRAY<FLOAT> NOT NULL,
+ INDEX idx_emb (`embedding`) USING ANN PROPERTIES(
+ "index_type"="ivf_on_disk",
+ "metric_type"="l2_distance",
+ "nlist"="3",
+ "dim"="3"
+ )
+ ) ENGINE=OLAP
+ DUPLICATE KEY(id)
+ DISTRIBUTED BY HASH(id) BUCKETS 1
+ PROPERTIES ("replication_num" = "1");
+ """
+
+ // Stream load to ivf_on_disk should succeed.
+ streamLoad {
+ table "tbl_ivf_on_disk_stream_load"
+ file "ivf_on_disk_stream_load.json"
+ set 'format', 'json'
+ set 'strip_outer_array', 'true'
+ time 10000
+
+ check { result, exception, startTime, endTime ->
+ if (exception != null) {
+ throw exception
+ }
+ log.info("Stream load result: ${result}".toString())
+ def json = parseJson(result)
+ assertEquals("success", json.Status.toLowerCase())
+ assertEquals(6, json.NumberTotalRows)
+ assertEquals(6, json.NumberLoadedRows)
+ assertEquals(0, json.NumberFilteredRows)
+ }
+ }
+
+ // ========== IVF_ON_DISK with larger dataset (more rows than nlist)
==========
+ sql "drop table if exists tbl_ivf_on_disk_large"
+ sql """
+ CREATE TABLE tbl_ivf_on_disk_large (
+ id INT NOT NULL,
+ embedding ARRAY<FLOAT> NOT NULL,
+ INDEX idx_emb (`embedding`) USING ANN PROPERTIES(
+ "index_type"="ivf_on_disk",
+ "metric_type"="l2_distance",
+ "nlist"="4",
+ "dim"="3"
+ )
+ ) ENGINE=OLAP
+ DUPLICATE KEY(id)
+ DISTRIBUTED BY HASH(id) BUCKETS 1
+ PROPERTIES ("replication_num" = "1");
+ """
+
+ sql """
+ INSERT INTO tbl_ivf_on_disk_large VALUES
+ (1, [1.0, 2.0, 3.0]),
+ (2, [0.5, 2.1, 2.9]),
+ (3, [10.0, 10.0, 10.0]),
+ (4, [20.0, 20.0, 20.0]),
+ (5, [50.0, 20.0, 20.0]),
+ (6, [60.0, 20.0, 20.0]),
+ (7, [5.0, 5.0, 5.0]),
+ (8, [100.0, 100.0, 100.0]),
+ (9, [0.0, 0.0, 0.0]),
+ (10, [30.0, 30.0, 30.0]);
+ """
+ qt_sql "select * from tbl_ivf_on_disk_large order by id;"
+ // approximate search on larger dataset
+ sql "select id, l2_distance_approximate(embedding, [1.0,2.0,3.0]) as dist
from tbl_ivf_on_disk_large order by dist limit 3;"
+
+ // ========== IVF_ON_DISK range search with l2_distance ==========
+ sql "drop table if exists tbl_ivf_on_disk_range"
+ sql """
+ CREATE TABLE tbl_ivf_on_disk_range (
+ id INT NOT NULL,
+ embedding ARRAY<FLOAT> NOT NULL,
+ INDEX idx_emb (`embedding`) USING ANN PROPERTIES(
+ "index_type"="ivf_on_disk",
+ "metric_type"="l2_distance",
+ "nlist"="3",
+ "dim"="3"
+ )
+ ) ENGINE=OLAP
+ DUPLICATE KEY(id)
+ DISTRIBUTED BY HASH(id) BUCKETS 1
+ PROPERTIES ("replication_num" = "1");
+ """
+
+ sql """
+ INSERT INTO tbl_ivf_on_disk_range VALUES
+ (1, [1.0, 2.0, 3.0]),
+ (2, [0.5, 2.1, 2.9]),
+ (3, [10.0, 10.0, 10.0]),
+ (4, [20.0, 20.0, 20.0]),
+ (5, [50.0, 20.0, 20.0]),
+ (6, [60.0, 20.0, 20.0]);
+ """
+ // range search: find vectors within distance threshold
+ sql "select id from tbl_ivf_on_disk_range where
l2_distance_approximate(embedding, [1.0, 2.0, 3.0]) < 20.0 order by id;"
+}
diff --git a/regression-test/suites/ann_index_p0/ivf_on_disk_stream_load.csv
b/regression-test/suites/ann_index_p0/ivf_on_disk_stream_load.csv
new file mode 100644
index 00000000000..df86b8021d1
--- /dev/null
+++ b/regression-test/suites/ann_index_p0/ivf_on_disk_stream_load.csv
@@ -0,0 +1,6 @@
+1|[1.0,2.0,3.0]
+2|[0.5,2.1,2.9]
+3|[10.0,10.0,10.0]
+4|[20.0,20.0,20.0]
+5|[50.0,20.0,20.0]
+6|[60.0,20.0,20.0]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]