This is an automated email from the ASF dual-hosted git repository.
morningman pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new 1ad85edddee IVF in ann index (#3136)
1ad85edddee is described below
commit 1ad85edddeed8e769f1ec18f91ffc1a7f1d74a87
Author: ivin <[email protected]>
AuthorDate: Tue Dec 2 12:31:04 2025 +0800
IVF in ann index (#3136)
## Versions
- [x] dev
- [x] 4.x
- [ ] 3.x
- [ ] 2.1
## Languages
- [x] Chinese
- [x] English
## Docs Checklist
- [ ] Checked by AI
- [ ] Test Cases Built
---
docs/ai/vector-search/hnsw.md | 4 +-
docs/ai/vector-search/ivf.md | 366 +++++++++++++++++++++
.../current/ai/vector-search/hnsw.md | 4 +-
.../current/ai/vector-search/ivf.md | 349 ++++++++++++++++++++
.../version-4.x/ai/vector-search/hnsw.md | 4 +-
.../version-4.x/ai/vector-search/ivf.md | 349 ++++++++++++++++++++
.../dataset-points-query-clusters.png | Bin 0 -> 351201 bytes
.../version-4.x/ai/vector-search/hnsw.md | 4 +-
versioned_docs/version-4.x/ai/vector-search/ivf.md | 366 +++++++++++++++++++++
9 files changed, 1438 insertions(+), 8 deletions(-)
diff --git a/docs/ai/vector-search/hnsw.md b/docs/ai/vector-search/hnsw.md
index f865046e216..e433493205b 100644
--- a/docs/ai/vector-search/hnsw.md
+++ b/docs/ai/vector-search/hnsw.md
@@ -190,7 +190,7 @@ from doris_vector_search import DorisVectorClient,
AuthOptions
auth = AuthOptions(
host="localhost",
- query_port=8030,
+ query_port=9030,
user="root",
password="",
)
@@ -377,7 +377,7 @@ The load generator runs on another 16‑core machine.
Benchmark command:
```bash
-NUM_PER_BATCH=1000000 python3.11 -m vectordbbench doris --host 127.0.0.1
--port 9030 --case-type Performance768D1M --db-name Performance768D1M
--search-concurrent --search-serial --num-concurrency 10,40,80
--stream-load-rows-per-batch 500000 --index-prop
max_degree=128,ef_construction=512 --session-var ef_search=128
+NUM_PER_BATCH=1000000 python3.11 -m vectordbbench doris --host 127.0.0.1
--port 9030 --case-type Performance768D1M --db-name Performance768D1M
--search-concurrent --search-serial --num-concurrency 10,40,80
--stream-load-rows-per-batch 500000 --index-prop
max_degree=128,ef_construction=512 --session-var hnsw_ef_search=128
```
| | Doris (FE/BE separate) | Doris (FE/BE mixed) |
diff --git a/docs/ai/vector-search/ivf.md b/docs/ai/vector-search/ivf.md
new file mode 100644
index 00000000000..71a009e209f
--- /dev/null
+++ b/docs/ai/vector-search/ivf.md
@@ -0,0 +1,366 @@
+---
+{
+ "title": "IVF",
+ "language": "en"
+}
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# IVF and How to use it in Apaceh Doris
+
+
+IVF index is an efficient data structure used for Approximate Nearest Neighbor
(ANN) search. It helps narrow down the scope of vectors during search,
significantly improving search speed. Since Apache Doris 4.x, an ANN index
based on IVF has been supported. This document walks through the IVF algorithm,
key parameters, and engineering practices, and explains how to build and tune
IVF‑based ANN indexes in production Doris clusters.
+
+## What is IVF index?
+
+For completeness, here’s some historical context. The term IVF (inverted file)
originates from information retrieval.
+
+Consider a simple example of a few text documents. To search documents that
contain a given word, a **forward index** stores a list of words for each
document. You must read each document explicitly to find the relevant ones.
+
+
+|Document|Words|
+|---|---|
+|Document 1|the,cow,says,moo|
+|Document 2|the,cat,and,the,hat|
+|Document 3|the,dish,ran,away,with,the,spoon|
+
+In contrast, an **inverted index** would contain a dictionary of all the words
that you can search, and for each word, you have a list of document indices
where the word occurs. This is the inverted list (inverted file), and it
enables you to restrict the search to the selected lists.
+
+
+| Word | Documents |
+| ---- | ---------------------------------------------------------- |
+| the | Document 1, Document 3, Document 4, Document 5, Document 7 |
+| cow | Document 2, Document 3, Document 4 |
+| says | Document 5 |
+| moo | Document 7 |
+
+
+Today, text data is often represented as vector embeddings. The IVF method
defines cluster centers and these centers are analogous to the dictionary of
words in the preceding example. For each cluster center, you have a list of
vector indices that belong to the cluster, and search is accelerated because
you only have to inspect the selected clusters.
+
+
+## Using IVF indexes for efficient vector search
+
+As datasets grow to millions or even billions of vectors, performing an
exhaustive exact k-nearest neighbor (kNN) search, calculating the distance
between a query and every single vector in the database becomes computationally
prohibitive. This brute-force approach, equivalent to a large matrix
multiplication, doesn't scale.
+
+Fortunately, many applications can trade a small amount of accuracy for a
massive gain in speed. This is the domain of Approximate Nearest Neighbor (ANN)
search, and the Inverted File (IVF) index is one of the most widely used and
effective ANN methods.
+
+The fundamental principle behind IVF is "partition and conquer." Instead of
searching the entire dataset, IVF intelligently narrows the search scope to a
few promising regions, drastically reducing the number of comparisons needed.
+
+IVF works by partitioning a large dataset of vectors into smaller, manageable
clusters, each represented by a central point called a "centroid." These
centroids act as anchors for their respective partitions. During a search, the
system quickly identifies the clusters whose centroids are closest to the query
vector and only searches within those, ignoring the rest of the dataset.
+
+
+
+
+
+## IVF in Apache Doris
+
+Apache Doris supports building IVF‑based ANN indexes starting from version 4.x.
+
+### Index Construction
+
+The index type used here is ANN. There are two ways to create an ANN index:
you can define it when creating the table, or you can use the `CREATE/BUILD
INDEX` syntax. The two approaches differ in how and when the index is built,
and therefore fit different scenarios.
+
+Approach 1: define an ANN index on a vector column when creating the table. As
data is loaded, an ANN index is built for each segment as it is created. The
advantage is that once data loading completes, the index is already built and
queries can immediately use it for acceleration. The downside is that
synchronous index building slows down data ingestion and may cause extra index
rebuilds during compaction, leading to some waste of resources.
+
+```sql
+CREATE TABLE sift_1M (
+ id int NOT NULL,
+ embedding array<float> NOT NULL COMMENT "",
+ INDEX ann_index (embedding) USING ANN PROPERTIES(
+ "index_type"="ivf",
+ "metric_type"="l2_distance",
+ "dim"="128",
+ "nlist"="1024"
+ )
+) ENGINE=OLAP
+DUPLICATE KEY(id) COMMENT "OLAP"
+DISTRIBUTED BY HASH(id) BUCKETS 1
+PROPERTIES (
+ "replication_num" = "1"
+);
+
+INSERT INTO sift_1M
+SELECT *
+FROM S3(
+ "uri" =
"https://selectdb-customers-tools-bj.oss-cn-beijing.aliyuncs.com/sift_database.tsv",
+ "format" = "csv");
+```
+
+#### CREATE/BUILD INDEX
+
+Approach 2: `CREATE/BUILD INDEX`.
+
+```sql
+CREATE TABLE sift_1M (
+ id int NOT NULL,
+ embedding array<float> NOT NULL COMMENT ""
+) ENGINE=OLAP
+DUPLICATE KEY(id) COMMENT "OLAP"
+DISTRIBUTED BY HASH(id) BUCKETS 1
+PROPERTIES (
+ "replication_num" = "1"
+);
+
+INSERT INTO sift_1M
+SELECT *
+FROM S3(
+ "uri" =
"https://selectdb-customers-tools-bj.oss-cn-beijing.aliyuncs.com/sift_database.tsv",
+ "format" = "csv");
+```
+
+After data is loaded, you can run `CREATE INDEX`. At this point the index is
defined on the table, but no index is yet built for the existing data.
+
+```sql
+CREATE INDEX idx_test_ann ON sift_1M (`embedding`) USING ANN PROPERTIES (
+ "index_type"="ivf",
+ "metric_type"="l2_distance",
+ "dim"="128",
+ "nlist"="1024"
+);
+
+SHOW DATA ALL FROM sift_1M;
+
+mysql> SHOW DATA ALL FROM sift_1M;
++-----------+-----------+--------------+----------+----------------+---------------+----------------+-----------------+----------------+-----------------+
+| TableName | IndexName | ReplicaCount | RowCount | LocalTotalSize |
LocalDataSize | LocalIndexSize | RemoteTotalSize | RemoteDataSize |
RemoteIndexSize |
++-----------+-----------+--------------+----------+----------------+---------------+----------------+-----------------+----------------+-----------------+
+| sift_1M | sift_1M | 10 | 1000000 | 170.093 MB | 170.093
MB | 0.000 | 0.000 | 0.000 | 0.000 |
+| | Total | 10 | | 170.093 MB | 170.093
MB | 0.000 | 0.000 | 0.000 | 0.000 |
++-----------+-----------+--------------+----------+----------------+---------------+----------------+-----------------+----------------+-----------------+
+```
+
+Then you can build the index using the `BUILD INDEX` statement:
+
+```sql
+BUILD INDEX idx_test_ann ON sift_1M;
+```
+
+`BUILD INDEX` is executed asynchronously. You can use `SHOW BUILD INDEX` (in
some versions `SHOW ALTER`) to check the job status.
+
+
+```sql
+SHOW BUILD INDEX WHERE TableName = "sift_1M";
+
+mysql> SHOW BUILD INDEX WHERE TableName = "sift_1M";
++---------------+-----------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+
+| JobId | TableName | PartitionName | AlterInvertedIndexes
| CreateTime | FinishTime
| TransactionId | State | Msg | Progress |
++---------------+-----------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+
+| 1764392359610 | sift_1M | sift_1M | [ADD INDEX idx_test_ann
(`embedding`) USING ANN PROPERTIES("dim" = "128", "index_type" = "ivf",
"metric_type" = "l2_distance", "nlist" = "1024")], | 2025-12-01 14:18:22.360 |
2025-12-01 14:18:27.885 | 5036 | FINISHED | | NULL |
++---------------+-----------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+
+1 row in set (0.00 sec)
+
+mysql> SHOW DATA ALL FROM sift_1M;
++-----------+-----------+--------------+----------+----------------+---------------+----------------+-----------------+----------------+-----------------+
+| TableName | IndexName | ReplicaCount | RowCount | LocalTotalSize |
LocalDataSize | LocalIndexSize | RemoteTotalSize | RemoteDataSize |
RemoteIndexSize |
++-----------+-----------+--------------+----------+----------------+---------------+----------------+-----------------+----------------+-----------------+
+| sift_1M | sift_1M | 10 | 1000000 | 671.084 MB | 170.093
MB | 500.991 MB | 0.000 | 0.000 | 0.000 |
+| | Total | 10 | | 671.084 MB | 170.093
MB | 500.991 MB | 0.000 | 0.000 | 0.000 |
++-----------+-----------+--------------+----------+----------------+---------------+----------------+-----------------+----------------+-----------------+
+2 rows in set (0.00 sec)
+```
+
+#### DROP INDEX
+
+You can drop an unsuitable ANN index with `ALTER TABLE sift_1M DROP INDEX
idx_test_ann`. Dropping and recreating indexes is common during hyperparameter
tuning, when you need to test different parameter combinations to achieve the
desired recall.
+
+
+### Querying
+
+ANN indexes support both Top‑N search and range search.
+
+When the vector column has high dimensionality, the literal representation of
the query vector itself can incur extra parsing overhead. Therefore, directly
embedding the full query vector into raw SQL is not recommended in production,
especially under high concurrency. A better practice is to use prepared
statements, which avoid repetitive SQL parsing.
+
+We recommend using the
[doris-vector-search](https://github.com/uchenily/doris_vector_search) python
library, which wraps the necessary operations for vector search in Doris based
on prepared statements, and includes data conversion utilities that map Doris
query results into Pandas `DataFrame`s for convenient downstream AI application
development.
+
+
+```python
+from doris_vector_search import DorisVectorClient, AuthOptions
+
+auth = AuthOptions(
+ host="127.0.0.1",
+ query_port=9030,
+ user="root",
+ password="",
+)
+
+client = DorisVectorClient(database="test", auth_options=auth)
+
+tbl = client.open_table("sift_1M")
+
+query = [0.1] * 128 # Example 128-dimensional vector
+
+# SELECT id FROM sift_1M ORDER BY l2_distance_approximate(embedding, query)
LIMIT 10;
+result = tbl.search(query,
metric_type="l2_distance").limit(10).select(["id"]).to_pandas()
+
+print(result)
+```
+
+
+Sample output:
+
+```text
+ id
+0 123911
+1 926855
+2 123739
+3 73311
+4 124493
+5 153178
+6 126138
+7 123740
+8 125741
+9 124048
+```
+
+
+### Recall Optimization
+
+
+In vector search, recall is the most important metric; performance numbers
only make sense under a given recall level. The main factors that affect recall
are:
+
+1. Index‑time parameter of IVF (`nlist`) and query-time parameter (`nprobe`).
+2. Vector quantization.
+3. Segment size and the number of segments.
+
+This article focuses on the impact of (1) and (3) on recall. Vector
quantization will be covered in a separate document.
+
+
+#### Index Hyperparameters
+
+An IVF index organizes vectors into multiple clusters. During index
construction, vectors are partitioned into groups using clustering. The search
process then focuses only on the most relevant clusters. The workflow is
roughly as follows:
+
+At index time:
+
+1. **Clustering**: All vectors are partitioned into `nlist` clusters using a
clustering algorithm (e.g., k‑means). The centroid of each cluster is computed
and stored.
+2. **Vector assignment**: Each vector is assigned to the cluster whose
centroid is closest to it, and the vector is added to that cluster’s inverted
list.
+
+At query time:
+
+1. **Cluster selection using nprobe**: For a query vector, distances to all
`nlist` centroids are computed. Only the `nprobe` closest clusters are selected
for searching.
+2. **Exhaustive search within selected clusters**: The query is compared
against every vector in the selected nprobe clusters to find the nearest
neighbors.
+
+In summary:
+
+`nlist` defines the number of clusters (inverted lists). It affects recall,
memory overhead, and build time. A larger `nlist` creates finer‑grained
clusters, which can improve search speed if the query’s nearest neighbors are
well‑localized, but it also increases the cost of clustering and the risk of
neighbors being spread across multiple clusters.
+
+`nprobe` defines the number of clusters to search during a query. A larger
`nprobe` increases recall and query latency (more vectors are examined). A
smaller nprobe makes queries faster but may miss neighbors that reside in
non‑probed clusters.
+
+
+By default, Doris uses `nlist = 1024` and `nprobe = 64`.
+
+
+The above is a qualitative analysis of these two hyperparameters. The
following table shows empirical results on the SIFT_1M dataset:
+
+
+| nlist | nprobe | recall_at_100 |
+| ----- | ------ | ------------- |
+| 1024 | 64 | 0.9542 |
+| 1024 | 32 | 0.9034 |
+| 1024 | 16 | 0.8299 |
+| 1024 | 8 | 0.7337 |
+| 512 | 32 | 0.9384 |
+| 512 | 16 | 0.8763 |
+| 512 | 8 | 0.7869 |
+
+
+It is hard to provide one single optimal setting in advance, but you can
follow a practical workflow for hyperparameter selection:
+
+1. Create a table `table_multi_index` without indexes. It can contain 2 or 3
vector columns.
+2. Load data into `table_multi_index` using Stream Load or other ingestion
methods.
+3. Use `CREATE INDEX` and `BUILD INDEX` to build ANN indexes on all vector
columns.
+4. Use different index parameter configurations on different columns. After
index building finishes, compute recall on each column and choose the best
parameter combination.
+
+for exmaple:
+
+```bash
+ALTER TABLE tbl DROP INDEX idx_embedding;
+CREATE INDEX idx_embedding ON tbl (`embedding`) USING ANN PROPERTIES (
+ "index_type"="ivf",
+ "metric_type"="inner_product",
+ "dim"="768",
+ "nlist"="1024"
+);
+BUILD INDEX idx_embedding ON tbl;
+```
+
+
+#### Number of Rows Covered per Index
+
+
+Internally, Doris organizes data in multiple layers.
+
+- At the top is a **table**, which is partitioned into N **tablets** using a
distribution key. Tablets serve as units for data sharding, relocation, and
rebalance.
+- Each data ingestion or compaction produces a new **rowset** under a tablet.
A rowset is a versioned collection of data.
+- Data in a rowset is actually stored in **segment** files.
+
+Similar to inverted indexes, vector indexes are built at the **segment**
level. The segment size is determined by BE configuration options like
`write_buffer_size` and `vertical_compaction_max_segment_size`. During
ingestion and compaction, when the in‑memory memtable reaches a certain size,
it is flushed to disk as a segment file, and a vector index (or multiple
indexes for multiple vector columns) is built for that segment. The index only
covers the rows in that segment.
+
+Given a fixed set of IVF parameters, there is always a limit to the number of
vectors for which the index can still maintain high recall. Once the number of
vectors in a segment grows beyond that limit, recall starts to degrade.
+
+
+
+> You can use `SHOW TABLETS FROM table` to inspect the compaction status of a
table. By following the corresponding URL, you can see how many segments it has.
+
+#### Impact of Compaction on Recall
+
+Compaction can affect recall because it may create larger segments, which can
exceed the “coverage capacity” implied by the original hyperparameters. As a
result, the recall level achieved before compaction may no longer hold after
compaction.
+
+We recommend triggering a full compaction before running `BUILD INDEX`.
Building indexes on fully compacted segments stabilizes recall and also reduces
write amplification caused by index rebuilds.
+
+### Query Performance
+
+#### Cold Loading of Index Files
+
+The IVF ANN index in Doris is implemented using Meta’s open‑source library
[Faiss](https://github.com/facebookresearch/faiss). IVF indexes become
effective after being loaded into memory. Therefore, before running
high‑concurrency workloads, it is recommended to run some warm‑up queries to
make sure that all relevant segment indexes are loaded into memory; otherwise,
disk I/O overhead can significantly hurt query performance.
+
+#### Memory Footprint vs. Performance
+
+Without quantization or compression, the memory footprint of an IVF index is
roughly 1.02-1.1× the memory footprint of all vectors it indexes.
+
+For example, with 1 million 128‑dimensional vectors, an IVF-FLAT index
requires approximately:
+
+`128 * 4 * 1,000,000 * 1.02 ≈ 500 MB`.
+
+Some reference values:
+
+| dim | rows | estimated memory |
+|-----|------|------------------|
+| 128 | 1M | 496 MB |
+| 768 | 1M | 2.9 GB |
+
+To maintain stable performance, ensure that each BE has enough memory;
otherwise, frequent swapping and I/O on index files will severely degrade query
latency.
+
+### Benchmark
+
+When benchmark, the deployment model should follow the production environment
setup, with FE and BE deployed separately, and the client should run on another
independent machine.
+
+You can use [VectorDBBench](https://github.com/zilliztech/VectorDBBench) as
benchmark framekwork.
+
+#### Performance768D1M
+
+Benchmark command:
+
+```bash
+# load
+NUM_PER_BATCH=1000000 python3 -m vectordbbench doris --host 127.0.0.1 --port
9030 --case-type Performance768D1M --db-name Performance768D1M
--stream-load-rows-per-batch 500000 --index-prop index_type=ivf,nlist=1024
--skip-search-serial --skip-search-concurrent
+
+# search
+NUM_PER_BATCH=1000000 python3 -m vectordbbench doris --host 127.0.0.1 --port
9030 --case-type Performance768D1M --db-name Performance768D1M
--search-concurrent --search-serial --num-concurrency 10,40,80
--stream-load-rows-per-batch 500000 --index-prop index_type=ivf,nlist=1024
--session-var ivf_nprobe=64 --skip-load --skip-drop-old
+```
+
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/hnsw.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/hnsw.md
index bb810b8db3b..87a5481f60f 100644
--- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/hnsw.md
+++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/hnsw.md
@@ -165,7 +165,7 @@ from doris_vector_search import DorisVectorClient,
AuthOptions
auth = AuthOptions(
host="localhost",
- query_port=8030,
+ query_port=9030,
user="root",
password="",
)
@@ -320,7 +320,7 @@ Doris 的 ANN 索引是基于 Meta 开源的
[faiss](https://github.com/facebook
#### Performance768D1M
测试命令
```bash
-NUM_PER_BATCH=1000000 python3.11 -m vectordbbench doris --host 127.0.0.1
--port 9030 --case-type Performance768D1M --db-name Performance768D1M
--search-concurrent --search-serial --num-concurrency 10,40,80
--stream-load-rows-per-batch 500000 --index-prop
max_degree=128,ef_construction=512 --session-var ef_search=128
+NUM_PER_BATCH=1000000 python3.11 -m vectordbbench doris --host 127.0.0.1
--port 9030 --case-type Performance768D1M --db-name Performance768D1M
--search-concurrent --search-serial --num-concurrency 10,40,80
--stream-load-rows-per-batch 500000 --index-prop
max_degree=128,ef_construction=512 --session-var hnsw_ef_search=128
```
| | Doris(FE/BE 分离) | Doris(FE/BE 混合) |
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/ivf.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/ivf.md
new file mode 100644
index 00000000000..a8f407d11d2
--- /dev/null
+++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/ivf.md
@@ -0,0 +1,349 @@
+---
+{
+ "title": "IVF",
+ "language": "zh-CN"
+}
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# IVF 以及如何在 Apache Doris 中使用 IVF 算法的索引
+
+IVF 索引是一种用于近似最近邻(ANN)搜索的高效数据结构。它能在搜索过程中缩小向量搜索范围,显著提高搜索速度。自 Apache Doris 4.x
版本起,已支持基于 IVF 的 ANN 索引。本文档将详细介绍 IVF 算法、关键参数和工程实践,并解释如何在生产环境的 Doris 集群中构建和调优基于
IVF 的 ANN 索引。
+
+## 什么是 IVF 索引?
+
+为便于理解,先介绍一些历史背景。术语 IVF(Inverted File)起源于信息检索领域。
+
+考虑一个简单的文本文档例子。要搜索包含给定单词的文档,**正向索引** 会存储每个文档的单词列表。必须显式读取每个文档才能找到相关文档。
+
+|Document|Words|
+|---|---|
+|Document 1|the,cow,says,moo|
+|Document 2|the,cat,and,the,hat|
+|Document 3|the,dish,ran,away,with,the,spoon|
+
+反过来, **倒排索引**
将包含一个可以搜索的所有单词的字典,对于每个单词,都有一个包含该单词的文档索引列表。这就是倒排列表(倒排文件),它能够将搜索范围限制在选定的列表中。
+
+| Word | Documents |
+| ---- | ---------------------------------------------------------- |
+| the | Document 1, Document 3, Document 4, Document 5, Document 7 |
+| cow | Document 2, Document 3, Document 4 |
+| says | Document 5 |
+| moo | Document 7 |
+
+如今,文本数据通常表示为向量嵌入。IVF
方法定义了聚类中心,这些中心类似于前面例子中的单词字典。对于每个聚类中心,都有一个属于该聚类的向量索引列表,搜索速度得以提升,因为只需检查选定的聚类。
+
+
+## 使用 IVF 索引进行高效向量搜索
+
+随着数据集增长到数百万甚至数十亿向量,执行穷举式精确
k-最近邻(kNN)搜索(计算查询向量与数据库中每个向量之间的距离)在计算上变得不可行。这种暴力方法相当于大型矩阵乘法,无法扩展。
+
+幸运的是,许多应用程序可以用少量的准确度换取速度的巨大提升。这就是近似最近邻(ANN) 搜索领域,而倒排文件(IVF) 索引是最广泛使用且有效的 ANN
方法之一。
+
+IVF 的基本原理是"分而治之"。IVF 不是搜索整个数据集,而是智能地将搜索范围缩小到几个有希望的区域,从而大大减少所需的比较次数。
+
+IVF
的工作原理是将大型向量数据集划分为更小、更易管理的聚类,每个聚类由一个称为"质心"的中心点表示。这些质心作为其各自分区的锚点。在搜索过程中,系统快速识别出其质心最接近查询向量的聚类,并仅在这些聚类内进行搜索,而忽略数据集的其余部分。
+
+
+
+
+
+## IVF in Apache Doris
+
+Apache Doris 从 4.x 版本开始支持构建基于 IVF 的 ANN 索引。
+
+### 索引构建
+
+这里使用的索引类型是 ANN。创建 ANN 索引有两种方式:可以在创建表时定义索引,也可以使用 `CREATE/BUILD INDEX`
语法。这两种方法在索引构建的时机和方式上有所不同,因此适用于不同的场景。
+
+方式一:建表时指定在某个向量列上创建索引。随着数据加载,会在每个段创建时为其构建 ANN
索引。优点是数据加载完成后,索引已经构建完毕,查询可以立即使用它进行加速。缺点是由于同步构建索引会减慢数据摄入速度,并且可能在压缩过程中导致额外的索引重建,造成一定的资源浪费。
+
+
+
+```sql
+CREATE TABLE sift_1M (
+ id int NOT NULL,
+ embedding array<float> NOT NULL COMMENT "",
+ INDEX ann_index (embedding) USING ANN PROPERTIES(
+ "index_type"="ivf",
+ "metric_type"="l2_distance",
+ "dim"="128",
+ "nlist"="1024"
+ )
+) ENGINE=OLAP
+DUPLICATE KEY(id) COMMENT "OLAP"
+DISTRIBUTED BY HASH(id) BUCKETS 1
+PROPERTIES (
+ "replication_num" = "1"
+);
+
+INSERT INTO sift_1M
+SELECT *
+FROM S3(
+ "uri" =
"https://selectdb-customers-tools-bj.oss-cn-beijing.aliyuncs.com/sift_database.tsv",
+ "format" = "csv");
+```
+
+#### CREATE/BUILD INDEX
+
+方式二:`CREATE/BUILD INDEX`。
+
+```sql
+CREATE TABLE sift_1M (
+ id int NOT NULL,
+ embedding array<float> NOT NULL COMMENT ""
+) ENGINE=OLAP
+DUPLICATE KEY(id) COMMENT "OLAP"
+DISTRIBUTED BY HASH(id) BUCKETS 1
+PROPERTIES (
+ "replication_num" = "1"
+);
+
+INSERT INTO sift_1M
+SELECT *
+FROM S3(
+ "uri" =
"https://selectdb-customers-tools-bj.oss-cn-beijing.aliyuncs.com/sift_database.tsv",
+ "format" = "csv");
+```
+
+导入数据后 CREATE INDEX,此时 table 上已经有了 index 的定义,但是没有真正在存量数据上构建索引。
+
+
+```sql
+CREATE INDEX idx_test_ann ON sift_1M (`embedding`) USING ANN PROPERTIES (
+ "index_type"="ivf",
+ "metric_type"="l2_distance",
+ "dim"="128",
+ "nlist"="1024"
+);
+
+SHOW DATA ALL FROM sift_1M;
+
+mysql> SHOW DATA ALL FROM sift_1M;
++-----------+-----------+--------------+----------+----------------+---------------+----------------+-----------------+----------------+-----------------+
+| TableName | IndexName | ReplicaCount | RowCount | LocalTotalSize |
LocalDataSize | LocalIndexSize | RemoteTotalSize | RemoteDataSize |
RemoteIndexSize |
++-----------+-----------+--------------+----------+----------------+---------------+----------------+-----------------+----------------+-----------------+
+| sift_1M | sift_1M | 10 | 1000000 | 170.093 MB | 170.093
MB | 0.000 | 0.000 | 0.000 | 0.000 |
+| | Total | 10 | | 170.093 MB | 170.093
MB | 0.000 | 0.000 | 0.000 | 0.000 |
++-----------+-----------+--------------+----------+----------------+---------------+----------------+-----------------+----------------+-----------------+
+```
+
+然后,就可以使用 `BUILD INDEX` 语句构建索引了:
+
+
+```sql
+BUILD INDEX idx_test_ann ON sift_1M;
+```
+
+BUILD INDEX 是异步执行的,需要通过 SHOW ALTER 来查看任务的执行状态。
+
+
+```sql
+SHOW BUILD INDEX WHERE TableName = "sift_1M";
+
+mysql> SHOW BUILD INDEX WHERE TableName = "sift_1M";
++---------------+-----------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+
+| JobId | TableName | PartitionName | AlterInvertedIndexes
| CreateTime | FinishTime
| TransactionId | State | Msg | Progress |
++---------------+-----------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+
+| 1764392359610 | sift_1M | sift_1M | [ADD INDEX idx_test_ann
(`embedding`) USING ANN PROPERTIES("dim" = "128", "index_type" = "ivf",
"metric_type" = "l2_distance", "nlist" = "1024")], | 2025-12-01 14:18:22.360 |
2025-12-01 14:18:27.885 | 5036 | FINISHED | | NULL |
++---------------+-----------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+
+1 row in set (0.00 sec)
+
+mysql> SHOW DATA ALL FROM sift_1M;
++-----------+-----------+--------------+----------+----------------+---------------+----------------+-----------------+----------------+-----------------+
+| TableName | IndexName | ReplicaCount | RowCount | LocalTotalSize |
LocalDataSize | LocalIndexSize | RemoteTotalSize | RemoteDataSize |
RemoteIndexSize |
++-----------+-----------+--------------+----------+----------------+---------------+----------------+-----------------+----------------+-----------------+
+| sift_1M | sift_1M | 10 | 1000000 | 671.084 MB | 170.093
MB | 500.991 MB | 0.000 | 0.000 | 0.000 |
+| | Total | 10 | | 671.084 MB | 170.093
MB | 500.991 MB | 0.000 | 0.000 | 0.000 |
++-----------+-----------+--------------+----------+----------------+---------------+----------------+-----------------+----------------+-----------------+
+2 rows in set (0.00 sec)
+```
+
+#### DROP INDEX
+
+同样可以通过 `ALTER TABLE sift_1M DROP INDEX idx_test_ann` 来删除不合适的 Ann 索引。DROP INDEX
通常发生在索引的超参数调优阶段,为了确保足够的召回率需要测试不同的参数组合,需要灵活的索引管理。
+
+
+### 进行查询
+
+ANN 索引支持对 topn search 还有 range search 进行加速。
+
+当向量列是高维向量时,用于描述查询向量的字符串本身会引入额外的解析开销,因此不建议在生产环境中,尤其是高并发场景里,直接使用原始 SQL
执行向量搜索查询。使用 prepare statement 来提前对 sql 进行解析是一个能够提高查询性能的做法,所以建议使用 doris 的向量搜索
[python library](https://github.com/uchenily/doris_vector_search),在这个 python
library 里面封装了基于 prepare statement 对 doris 进行向量搜索的必要的操作,并且集成了相关的数据转化流程,可以直接将
doris 的查询结果转为 pandas 的 DataFrame,方便用户基于 doris 开发 AI 应用。
+
+
+```python
+from doris_vector_search import DorisVectorClient, AuthOptions
+
+auth = AuthOptions(
+ host="127.0.0.1",
+ query_port=9030,
+ user="root",
+ password="",
+)
+
+client = DorisVectorClient(database="test", auth_options=auth)
+
+tbl = client.open_table("sift_1M")
+
+query = [0.1] * 128 # Example 128-dimensional vector
+
+# SELECT id FROM sift_1M ORDER BY l2_distance_approximate(embedding, query)
LIMIT 10;
+result = tbl.search(query,
metric_type="l2_distance").limit(10).select(["id"]).to_pandas()
+
+print(result)
+```
+
+上面的 python 脚本执行结果为:
+
+
+```text
+ id
+0 123911
+1 926855
+2 123739
+3 73311
+4 124493
+5 153178
+6 126138
+7 123740
+8 125741
+9 124048
+```
+
+
+### 召回率优化
+
+向量搜索场景里面最重要的指标是召回率,一切性能数据只有在满足一定的召回率的前提下才有意义。影响召回率的因素主要包括:
+1. IVF 的索引阶段参数(nlist)和查询阶段参数(nprobe)
+2. 索引向量量化
+3. segment 的大小与数量
+
+这篇文章里我们将会讨论 1,3 对于召回率的影响,关于向量量化会在其他的文章里进行介绍。
+
+#### 索引超参数
+
+IVF 索引将向量组织到多个聚类中。在索引构建过程中,使用聚类算法将向量分组。然后,搜索过程仅聚焦于最相关的聚类。工作流程大致如下:
+
+
+索引构建阶段:
+
+1. 聚类:使用聚类算法(例如 k‑means)将所有向量划分为 `nlist` 个聚类。计算并存储每个聚类的质心。
+2. 向量分配:每个向量被分配到与其质心最接近的聚类,并将该向量添加到该聚类的倒排列表中。
+
+查询阶段:
+
+1. 使用 `nprobe` 选择聚类:对于查询向量,计算到所有 `nlist` 个质心的距离。仅选择 `nprobe` 个最近的聚类进行搜索。
+2. 在选定聚类内进行穷举搜索:将查询与选定 `nprobe` 个聚类中的每个向量进行比较,以找到最近邻。
+
+总之:
+
+`nlist` 定义了聚类的数量(倒排列表的数量)。它影响召回率、内存开销和构建时间。较大的 `nlist`
会创建更细粒度的聚类,这可以提高搜索速度,但同时也会增加聚类成本和邻居分散在多个聚类中的风险。
+
+`nprobe` 定义了查询阶段要搜索的聚类数量。较大的 `nprobe` 会提高召回率和查询延迟(需要检查更多的向量)。较小的 `nprobe`
使查询更快,但可能会遗漏位于未探测聚类中的邻居。
+
+Doris 默认的 `nlist` 为 1024, 默认的 `nprobe` 为 64。
+
+
+上述测试都是对这两个超参数定性的分析,通过实际实验,在 SIFT_1M 数据集上有如下的测试结果:
+
+
+| nlist | nprobe | recall_at_100 |
+| ----- | ------ | ------------- |
+| 1024 | 64 | 0.9542 |
+| 1024 | 32 | 0.9034 |
+| 1024 | 16 | 0.8299 |
+| 1024 | 8 | 0.7337 |
+| 512 | 32 | 0.9384 |
+| 512 | 16 | 0.8763 |
+| 512 | 8 | 0.7869 |
+
+
+虽然很难事先给出超参数的具体取值,但是我们可以给出一个关于如何选取超参数的实践方法:
+1. 建立一张无索引的表 table_multi_index,table_multi_index 可以有 2 或者 3 个向量列
+2. 通过 stream load 等方式将数据导入到无索引的 table_multi_index
+3. 通过 `CREATE INDEX` 和 `BUILD INDEX` 在所有的向量列上构建索引
+4. 不同的列选择不同的索引参数,等索引构建完成后在不同的列上计算召回率,找到最合适的超参数组合
+
+示例:
+
+```sql
+ALTER TABLE tbl DROP INDEX idx_embedding;
+CREATE INDEX idx_embedding ON tbl (`embedding`) USING ANN PROPERTIES (
+ "index_type"="ivf",
+ "metric_type"="inner_product",
+ "dim"="768",
+ "nlist"="1024"
+);
+BUILD INDEX idx_embedding ON tbl;
+```
+
+#### 索引覆盖的行数
+
+Doris 内表的数据是分层组织的。最高层级的概念是 Table,Table 按照分桶键把原始数据尽可能均匀地分布到 N 个 tablets
里面,tablet
是用来进行数据迁移与rebalance的基本单位。每次导入或者compaction会在tablet下新增一个rowset,rowset是进行版本管理的单位,其本身只是代表一组具有版本号的数据,这组数据真正存储在
segment 文件里。
+
+
+与倒排索引一样,向量索引也作用在 segment 粒度上。segment 本身的大小取决于 be conf 中的 write_buffer_size 和
vertical_compaction_max_segment_size,在导入和 compaction 过程中,当内存中 memtable
的积累到一定大小后就会下刷生成一个 segment 文件,并且为该 segment
构造一个向量索引(如果有多个索引列那么就有多个索引),该索引能够覆盖的范围就是这个 segment 中对应列的行数。根据前面对 IVF
算法搜索与构建过程的介绍,对于某组索引参数,其能够有效覆盖的数据范围是有限的,当数据量超过某个阈值后,召回率就无法满足要求。
+
+
+> 通过 SHOW TABLETS FROM table 可以看到某张表的 Compaction 状态,点开对应的 URL 可以看到这张表有多少个
segment。
+
+
+#### Compaction 对召回率的影响
+
+Compaction 之所以会影响召回率是因为 compaction 有时会生成更大的 segment,导致原先的索引超参数无法在新的更大的 segment
上保障覆盖率。因此建议在 `BUILD INDEX` 之前触发一次 FULL COMPACTION,在充分合并过的 segment
上构建索引不光可以保持召回率稳定,还可以减少索引构建引入的写放大。
+
+### 查询性能
+#### 索引文件的冷加载
+
+Doris 的 ANN 索引是基于 Meta 开源的 [faiss](https://github.com/facebookresearch/faiss)
实现的。IVF 索引被全部被加载进内存后才能进行查询加速,因此建议在高并发查询之前,先进行一次冷查询,确保涉及到的 segment
的索引文件全部加载进了内存,否则对于查询性能会有较大影响。
+
+#### 内存空间与性能
+
+**IVF 索引(无量化压缩)占用的内存空间近似等于其所能检索的向量的内存大小的 1.02 倍**。
+
+比如对于 128 维,1M 的数据集,IVF FLAT 索引需要的内存空间大约为 128 * 4 * 1000000 * 1.02 约等于 500 MB。
+
+一些参考值:
+
+| dim | rows | estimated memory |
+|-----|------|------------------|
+| 128 | 1M | 496 MB |
+| 768 | 1M | 2.9 GB |
+
+为了保证查询性能,需要 BE 有足够的内存空间,否则索引的频繁 IO 会导致查询性能大幅衰减。
+
+
+### Benchmark
+
+Benchmark时应该按照生产环境部署模式, FE与BE分开部署, 另外客户端应在另一台独立的机器上运行。
+
+测试框架可以使用 [VectorDBBench](https://github.com/zilliztech/VectorDBBench)。
+
+#### Performance768D1M
+
+压测命令:
+
+```bash
+# load
+NUM_PER_BATCH=1000000 python3 -m vectordbbench doris --host 127.0.0.1 --port
9030 --case-type Performance768D1M --db-name Performance768D1M
--stream-load-rows-per-batch 500000 --index-prop index_type=ivf,nlist=1024
--skip-search-serial --skip-search-concurrent
+
+# search
+NUM_PER_BATCH=1000000 python3 -m vectordbbench doris --host 127.0.0.1 --port
9030 --case-type Performance768D1M --db-name Performance768D1M
--search-concurrent --search-serial --num-concurrency 10,40,80
--stream-load-rows-per-batch 500000 --index-prop index_type=ivf,nlist=1024
--session-var ivf_nprobe=64 --skip-load --skip-drop-old
+```
+
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/hnsw.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/hnsw.md
index bb810b8db3b..87a5481f60f 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/hnsw.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/hnsw.md
@@ -165,7 +165,7 @@ from doris_vector_search import DorisVectorClient,
AuthOptions
auth = AuthOptions(
host="localhost",
- query_port=8030,
+ query_port=9030,
user="root",
password="",
)
@@ -320,7 +320,7 @@ Doris 的 ANN 索引是基于 Meta 开源的
[faiss](https://github.com/facebook
#### Performance768D1M
测试命令
```bash
-NUM_PER_BATCH=1000000 python3.11 -m vectordbbench doris --host 127.0.0.1
--port 9030 --case-type Performance768D1M --db-name Performance768D1M
--search-concurrent --search-serial --num-concurrency 10,40,80
--stream-load-rows-per-batch 500000 --index-prop
max_degree=128,ef_construction=512 --session-var ef_search=128
+NUM_PER_BATCH=1000000 python3.11 -m vectordbbench doris --host 127.0.0.1
--port 9030 --case-type Performance768D1M --db-name Performance768D1M
--search-concurrent --search-serial --num-concurrency 10,40,80
--stream-load-rows-per-batch 500000 --index-prop
max_degree=128,ef_construction=512 --session-var hnsw_ef_search=128
```
| | Doris(FE/BE 分离) | Doris(FE/BE 混合) |
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/ivf.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/ivf.md
new file mode 100644
index 00000000000..a8f407d11d2
--- /dev/null
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/ivf.md
@@ -0,0 +1,349 @@
+---
+{
+ "title": "IVF",
+ "language": "zh-CN"
+}
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# IVF 以及如何在 Apache Doris 中使用 IVF 算法的索引
+
+IVF 索引是一种用于近似最近邻(ANN)搜索的高效数据结构。它能在搜索过程中缩小向量搜索范围,显著提高搜索速度。自 Apache Doris 4.x
版本起,已支持基于 IVF 的 ANN 索引。本文档将详细介绍 IVF 算法、关键参数和工程实践,并解释如何在生产环境的 Doris 集群中构建和调优基于
IVF 的 ANN 索引。
+
+## 什么是 IVF 索引?
+
+为便于理解,先介绍一些历史背景。术语 IVF(Inverted File)起源于信息检索领域。
+
+考虑一个简单的文本文档例子。要搜索包含给定单词的文档,**正向索引** 会存储每个文档的单词列表。必须显式读取每个文档才能找到相关文档。
+
+|Document|Words|
+|---|---|
+|Document 1|the,cow,says,moo|
+|Document 2|the,cat,and,the,hat|
+|Document 3|the,dish,ran,away,with,the,spoon|
+
+反过来, **倒排索引**
将包含一个可以搜索的所有单词的字典,对于每个单词,都有一个包含该单词的文档索引列表。这就是倒排列表(倒排文件),它能够将搜索范围限制在选定的列表中。
+
+| Word | Documents |
+| ---- | ---------------------------------------------------------- |
+| the | Document 1, Document 3, Document 4, Document 5, Document 7 |
+| cow | Document 2, Document 3, Document 4 |
+| says | Document 5 |
+| moo | Document 7 |
+
+如今,文本数据通常表示为向量嵌入。IVF
方法定义了聚类中心,这些中心类似于前面例子中的单词字典。对于每个聚类中心,都有一个属于该聚类的向量索引列表,搜索速度得以提升,因为只需检查选定的聚类。
+
+
+## 使用 IVF 索引进行高效向量搜索
+
+随着数据集增长到数百万甚至数十亿向量,执行穷举式精确
k-最近邻(kNN)搜索(计算查询向量与数据库中每个向量之间的距离)在计算上变得不可行。这种暴力方法相当于大型矩阵乘法,无法扩展。
+
+幸运的是,许多应用程序可以用少量的准确度换取速度的巨大提升。这就是近似最近邻(ANN) 搜索领域,而倒排文件(IVF) 索引是最广泛使用且有效的 ANN
方法之一。
+
+IVF 的基本原理是"分而治之"。IVF 不是搜索整个数据集,而是智能地将搜索范围缩小到几个有希望的区域,从而大大减少所需的比较次数。
+
+IVF
的工作原理是将大型向量数据集划分为更小、更易管理的聚类,每个聚类由一个称为"质心"的中心点表示。这些质心作为其各自分区的锚点。在搜索过程中,系统快速识别出其质心最接近查询向量的聚类,并仅在这些聚类内进行搜索,而忽略数据集的其余部分。
+
+
+
+
+
+## IVF in Apache Doris
+
+Apache Doris 从 4.x 版本开始支持构建基于 IVF 的 ANN 索引。
+
+### 索引构建
+
+这里使用的索引类型是 ANN。创建 ANN 索引有两种方式:可以在创建表时定义索引,也可以使用 `CREATE/BUILD INDEX`
语法。这两种方法在索引构建的时机和方式上有所不同,因此适用于不同的场景。
+
+方式一:建表时指定在某个向量列上创建索引。随着数据加载,会在每个段创建时为其构建 ANN
索引。优点是数据加载完成后,索引已经构建完毕,查询可以立即使用它进行加速。缺点是由于同步构建索引会减慢数据摄入速度,并且可能在压缩过程中导致额外的索引重建,造成一定的资源浪费。
+
+
+
+```sql
+CREATE TABLE sift_1M (
+ id int NOT NULL,
+ embedding array<float> NOT NULL COMMENT "",
+ INDEX ann_index (embedding) USING ANN PROPERTIES(
+ "index_type"="ivf",
+ "metric_type"="l2_distance",
+ "dim"="128",
+ "nlist"="1024"
+ )
+) ENGINE=OLAP
+DUPLICATE KEY(id) COMMENT "OLAP"
+DISTRIBUTED BY HASH(id) BUCKETS 1
+PROPERTIES (
+ "replication_num" = "1"
+);
+
+INSERT INTO sift_1M
+SELECT *
+FROM S3(
+ "uri" =
"https://selectdb-customers-tools-bj.oss-cn-beijing.aliyuncs.com/sift_database.tsv",
+ "format" = "csv");
+```
+
+#### CREATE/BUILD INDEX
+
+方式二:`CREATE/BUILD INDEX`。
+
+```sql
+CREATE TABLE sift_1M (
+ id int NOT NULL,
+ embedding array<float> NOT NULL COMMENT ""
+) ENGINE=OLAP
+DUPLICATE KEY(id) COMMENT "OLAP"
+DISTRIBUTED BY HASH(id) BUCKETS 1
+PROPERTIES (
+ "replication_num" = "1"
+);
+
+INSERT INTO sift_1M
+SELECT *
+FROM S3(
+ "uri" =
"https://selectdb-customers-tools-bj.oss-cn-beijing.aliyuncs.com/sift_database.tsv",
+ "format" = "csv");
+```
+
+导入数据后 CREATE INDEX,此时 table 上已经有了 index 的定义,但是没有真正在存量数据上构建索引。
+
+
+```sql
+CREATE INDEX idx_test_ann ON sift_1M (`embedding`) USING ANN PROPERTIES (
+ "index_type"="ivf",
+ "metric_type"="l2_distance",
+ "dim"="128",
+ "nlist"="1024"
+);
+
+SHOW DATA ALL FROM sift_1M;
+
+mysql> SHOW DATA ALL FROM sift_1M;
++-----------+-----------+--------------+----------+----------------+---------------+----------------+-----------------+----------------+-----------------+
+| TableName | IndexName | ReplicaCount | RowCount | LocalTotalSize |
LocalDataSize | LocalIndexSize | RemoteTotalSize | RemoteDataSize |
RemoteIndexSize |
++-----------+-----------+--------------+----------+----------------+---------------+----------------+-----------------+----------------+-----------------+
+| sift_1M | sift_1M | 10 | 1000000 | 170.093 MB | 170.093
MB | 0.000 | 0.000 | 0.000 | 0.000 |
+| | Total | 10 | | 170.093 MB | 170.093
MB | 0.000 | 0.000 | 0.000 | 0.000 |
++-----------+-----------+--------------+----------+----------------+---------------+----------------+-----------------+----------------+-----------------+
+```
+
+然后,就可以使用 `BUILD INDEX` 语句构建索引了:
+
+
+```sql
+BUILD INDEX idx_test_ann ON sift_1M;
+```
+
+BUILD INDEX 是异步执行的,需要通过 SHOW ALTER 来查看任务的执行状态。
+
+
+```sql
+SHOW BUILD INDEX WHERE TableName = "sift_1M";
+
+mysql> SHOW BUILD INDEX WHERE TableName = "sift_1M";
++---------------+-----------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+
+| JobId | TableName | PartitionName | AlterInvertedIndexes
| CreateTime | FinishTime
| TransactionId | State | Msg | Progress |
++---------------+-----------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+
+| 1764392359610 | sift_1M | sift_1M | [ADD INDEX idx_test_ann
(`embedding`) USING ANN PROPERTIES("dim" = "128", "index_type" = "ivf",
"metric_type" = "l2_distance", "nlist" = "1024")], | 2025-12-01 14:18:22.360 |
2025-12-01 14:18:27.885 | 5036 | FINISHED | | NULL |
++---------------+-----------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+
+1 row in set (0.00 sec)
+
+mysql> SHOW DATA ALL FROM sift_1M;
++-----------+-----------+--------------+----------+----------------+---------------+----------------+-----------------+----------------+-----------------+
+| TableName | IndexName | ReplicaCount | RowCount | LocalTotalSize |
LocalDataSize | LocalIndexSize | RemoteTotalSize | RemoteDataSize |
RemoteIndexSize |
++-----------+-----------+--------------+----------+----------------+---------------+----------------+-----------------+----------------+-----------------+
+| sift_1M | sift_1M | 10 | 1000000 | 671.084 MB | 170.093
MB | 500.991 MB | 0.000 | 0.000 | 0.000 |
+| | Total | 10 | | 671.084 MB | 170.093
MB | 500.991 MB | 0.000 | 0.000 | 0.000 |
++-----------+-----------+--------------+----------+----------------+---------------+----------------+-----------------+----------------+-----------------+
+2 rows in set (0.00 sec)
+```
+
+#### DROP INDEX
+
+同样可以通过 `ALTER TABLE sift_1M DROP INDEX idx_test_ann` 来删除不合适的 Ann 索引。DROP INDEX
通常发生在索引的超参数调优阶段,为了确保足够的召回率需要测试不同的参数组合,需要灵活的索引管理。
+
+
+### 进行查询
+
+ANN 索引支持对 topn search 还有 range search 进行加速。
+
+当向量列是高维向量时,用于描述查询向量的字符串本身会引入额外的解析开销,因此不建议在生产环境中,尤其是高并发场景里,直接使用原始 SQL
执行向量搜索查询。使用 prepare statement 来提前对 sql 进行解析是一个能够提高查询性能的做法,所以建议使用 doris 的向量搜索
[python library](https://github.com/uchenily/doris_vector_search),在这个 python
library 里面封装了基于 prepare statement 对 doris 进行向量搜索的必要的操作,并且集成了相关的数据转化流程,可以直接将
doris 的查询结果转为 pandas 的 DataFrame,方便用户基于 doris 开发 AI 应用。
+
+
+```python
+from doris_vector_search import DorisVectorClient, AuthOptions
+
+auth = AuthOptions(
+ host="127.0.0.1",
+ query_port=9030,
+ user="root",
+ password="",
+)
+
+client = DorisVectorClient(database="test", auth_options=auth)
+
+tbl = client.open_table("sift_1M")
+
+query = [0.1] * 128 # Example 128-dimensional vector
+
+# SELECT id FROM sift_1M ORDER BY l2_distance_approximate(embedding, query)
LIMIT 10;
+result = tbl.search(query,
metric_type="l2_distance").limit(10).select(["id"]).to_pandas()
+
+print(result)
+```
+
+上面的 python 脚本执行结果为:
+
+
+```text
+ id
+0 123911
+1 926855
+2 123739
+3 73311
+4 124493
+5 153178
+6 126138
+7 123740
+8 125741
+9 124048
+```
+
+
+### 召回率优化
+
+向量搜索场景里面最重要的指标是召回率,一切性能数据只有在满足一定的召回率的前提下才有意义。影响召回率的因素主要包括:
+1. IVF 的索引阶段参数(nlist)和查询阶段参数(nprobe)
+2. 索引向量量化
+3. segment 的大小与数量
+
+这篇文章里我们将会讨论 1,3 对于召回率的影响,关于向量量化会在其他的文章里进行介绍。
+
+#### 索引超参数
+
+IVF 索引将向量组织到多个聚类中。在索引构建过程中,使用聚类算法将向量分组。然后,搜索过程仅聚焦于最相关的聚类。工作流程大致如下:
+
+
+索引构建阶段:
+
+1. 聚类:使用聚类算法(例如 k‑means)将所有向量划分为 `nlist` 个聚类。计算并存储每个聚类的质心。
+2. 向量分配:每个向量被分配到与其质心最接近的聚类,并将该向量添加到该聚类的倒排列表中。
+
+查询阶段:
+
+1. 使用 `nprobe` 选择聚类:对于查询向量,计算到所有 `nlist` 个质心的距离。仅选择 `nprobe` 个最近的聚类进行搜索。
+2. 在选定聚类内进行穷举搜索:将查询与选定 `nprobe` 个聚类中的每个向量进行比较,以找到最近邻。
+
+总之:
+
+`nlist` 定义了聚类的数量(倒排列表的数量)。它影响召回率、内存开销和构建时间。较大的 `nlist`
会创建更细粒度的聚类,这可以提高搜索速度,但同时也会增加聚类成本和邻居分散在多个聚类中的风险。
+
+`nprobe` 定义了查询阶段要搜索的聚类数量。较大的 `nprobe` 会提高召回率和查询延迟(需要检查更多的向量)。较小的 `nprobe`
使查询更快,但可能会遗漏位于未探测聚类中的邻居。
+
+Doris 默认的 `nlist` 为 1024, 默认的 `nprobe` 为 64。
+
+
+上述测试都是对这两个超参数定性的分析,通过实际实验,在 SIFT_1M 数据集上有如下的测试结果:
+
+
+| nlist | nprobe | recall_at_100 |
+| ----- | ------ | ------------- |
+| 1024 | 64 | 0.9542 |
+| 1024 | 32 | 0.9034 |
+| 1024 | 16 | 0.8299 |
+| 1024 | 8 | 0.7337 |
+| 512 | 32 | 0.9384 |
+| 512 | 16 | 0.8763 |
+| 512 | 8 | 0.7869 |
+
+
+虽然很难事先给出超参数的具体取值,但是我们可以给出一个关于如何选取超参数的实践方法:
+1. 建立一张无索引的表 table_multi_index,table_multi_index 可以有 2 或者 3 个向量列
+2. 通过 stream load 等方式将数据导入到无索引的 table_multi_index
+3. 通过 `CREATE INDEX` 和 `BUILD INDEX` 在所有的向量列上构建索引
+4. 不同的列选择不同的索引参数,等索引构建完成后在不同的列上计算召回率,找到最合适的超参数组合
+
+示例:
+
+```sql
+ALTER TABLE tbl DROP INDEX idx_embedding;
+CREATE INDEX idx_embedding ON tbl (`embedding`) USING ANN PROPERTIES (
+ "index_type"="ivf",
+ "metric_type"="inner_product",
+ "dim"="768",
+ "nlist"="1024"
+);
+BUILD INDEX idx_embedding ON tbl;
+```
+
+#### 索引覆盖的行数
+
+Doris 内表的数据是分层组织的。最高层级的概念是 Table,Table 按照分桶键把原始数据尽可能均匀地分布到 N 个 tablets
里面,tablet
是用来进行数据迁移与rebalance的基本单位。每次导入或者compaction会在tablet下新增一个rowset,rowset是进行版本管理的单位,其本身只是代表一组具有版本号的数据,这组数据真正存储在
segment 文件里。
+
+
+与倒排索引一样,向量索引也作用在 segment 粒度上。segment 本身的大小取决于 be conf 中的 write_buffer_size 和
vertical_compaction_max_segment_size,在导入和 compaction 过程中,当内存中 memtable
的积累到一定大小后就会下刷生成一个 segment 文件,并且为该 segment
构造一个向量索引(如果有多个索引列那么就有多个索引),该索引能够覆盖的范围就是这个 segment 中对应列的行数。根据前面对 IVF
算法搜索与构建过程的介绍,对于某组索引参数,其能够有效覆盖的数据范围是有限的,当数据量超过某个阈值后,召回率就无法满足要求。
+
+
+> 通过 SHOW TABLETS FROM table 可以看到某张表的 Compaction 状态,点开对应的 URL 可以看到这张表有多少个
segment。
+
+
+#### Compaction 对召回率的影响
+
+Compaction 之所以会影响召回率是因为 compaction 有时会生成更大的 segment,导致原先的索引超参数无法在新的更大的 segment
上保障覆盖率。因此建议在 `BUILD INDEX` 之前触发一次 FULL COMPACTION,在充分合并过的 segment
上构建索引不光可以保持召回率稳定,还可以减少索引构建引入的写放大。
+
+### 查询性能
+#### 索引文件的冷加载
+
+Doris 的 ANN 索引是基于 Meta 开源的 [faiss](https://github.com/facebookresearch/faiss)
实现的。IVF 索引被全部被加载进内存后才能进行查询加速,因此建议在高并发查询之前,先进行一次冷查询,确保涉及到的 segment
的索引文件全部加载进了内存,否则对于查询性能会有较大影响。
+
+#### 内存空间与性能
+
+**IVF 索引(无量化压缩)占用的内存空间近似等于其所能检索的向量的内存大小的 1.02 倍**。
+
+比如对于 128 维,1M 的数据集,IVF FLAT 索引需要的内存空间大约为 128 * 4 * 1000000 * 1.02 约等于 500 MB。
+
+一些参考值:
+
+| dim | rows | estimated memory |
+|-----|------|------------------|
+| 128 | 1M | 496 MB |
+| 768 | 1M | 2.9 GB |
+
+为了保证查询性能,需要 BE 有足够的内存空间,否则索引的频繁 IO 会导致查询性能大幅衰减。
+
+
+### Benchmark
+
+Benchmark时应该按照生产环境部署模式, FE与BE分开部署, 另外客户端应在另一台独立的机器上运行。
+
+测试框架可以使用 [VectorDBBench](https://github.com/zilliztech/VectorDBBench)。
+
+#### Performance768D1M
+
+压测命令:
+
+```bash
+# load
+NUM_PER_BATCH=1000000 python3 -m vectordbbench doris --host 127.0.0.1 --port
9030 --case-type Performance768D1M --db-name Performance768D1M
--stream-load-rows-per-batch 500000 --index-prop index_type=ivf,nlist=1024
--skip-search-serial --skip-search-concurrent
+
+# search
+NUM_PER_BATCH=1000000 python3 -m vectordbbench doris --host 127.0.0.1 --port
9030 --case-type Performance768D1M --db-name Performance768D1M
--search-concurrent --search-serial --num-concurrency 10,40,80
--stream-load-rows-per-batch 500000 --index-prop index_type=ivf,nlist=1024
--session-var ivf_nprobe=64 --skip-load --skip-drop-old
+```
+
diff --git a/static/images/vector-search/dataset-points-query-clusters.png
b/static/images/vector-search/dataset-points-query-clusters.png
new file mode 100644
index 00000000000..e83a6e09f55
Binary files /dev/null and
b/static/images/vector-search/dataset-points-query-clusters.png differ
diff --git a/versioned_docs/version-4.x/ai/vector-search/hnsw.md
b/versioned_docs/version-4.x/ai/vector-search/hnsw.md
index 91e9c5266c3..4701bad4d29 100644
--- a/versioned_docs/version-4.x/ai/vector-search/hnsw.md
+++ b/versioned_docs/version-4.x/ai/vector-search/hnsw.md
@@ -189,7 +189,7 @@ from doris_vector_search import DorisVectorClient,
AuthOptions
auth = AuthOptions(
host="localhost",
- query_port=8030,
+ query_port=9030,
user="root",
password="",
)
@@ -376,7 +376,7 @@ The load generator runs on another 16‑core machine.
Benchmark command:
```bash
-NUM_PER_BATCH=1000000 python3.11 -m vectordbbench doris --host 127.0.0.1
--port 9030 --case-type Performance768D1M --db-name Performance768D1M
--search-concurrent --search-serial --num-concurrency 10,40,80
--stream-load-rows-per-batch 500000 --index-prop
max_degree=128,ef_construction=512 --session-var ef_search=128
+NUM_PER_BATCH=1000000 python3.11 -m vectordbbench doris --host 127.0.0.1
--port 9030 --case-type Performance768D1M --db-name Performance768D1M
--search-concurrent --search-serial --num-concurrency 10,40,80
--stream-load-rows-per-batch 500000 --index-prop
max_degree=128,ef_construction=512 --session-var hnsw_ef_search=128
```
| | Doris (FE/BE separate) | Doris (FE/BE mixed) |
diff --git a/versioned_docs/version-4.x/ai/vector-search/ivf.md
b/versioned_docs/version-4.x/ai/vector-search/ivf.md
new file mode 100644
index 00000000000..71a009e209f
--- /dev/null
+++ b/versioned_docs/version-4.x/ai/vector-search/ivf.md
@@ -0,0 +1,366 @@
+---
+{
+ "title": "IVF",
+ "language": "en"
+}
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# IVF and How to use it in Apaceh Doris
+
+
+IVF index is an efficient data structure used for Approximate Nearest Neighbor
(ANN) search. It helps narrow down the scope of vectors during search,
significantly improving search speed. Since Apache Doris 4.x, an ANN index
based on IVF has been supported. This document walks through the IVF algorithm,
key parameters, and engineering practices, and explains how to build and tune
IVF‑based ANN indexes in production Doris clusters.
+
+## What is IVF index?
+
+For completeness, here’s some historical context. The term IVF (inverted file)
originates from information retrieval.
+
+Consider a simple example of a few text documents. To search documents that
contain a given word, a **forward index** stores a list of words for each
document. You must read each document explicitly to find the relevant ones.
+
+
+|Document|Words|
+|---|---|
+|Document 1|the,cow,says,moo|
+|Document 2|the,cat,and,the,hat|
+|Document 3|the,dish,ran,away,with,the,spoon|
+
+In contrast, an **inverted index** would contain a dictionary of all the words
that you can search, and for each word, you have a list of document indices
where the word occurs. This is the inverted list (inverted file), and it
enables you to restrict the search to the selected lists.
+
+
+| Word | Documents |
+| ---- | ---------------------------------------------------------- |
+| the | Document 1, Document 3, Document 4, Document 5, Document 7 |
+| cow | Document 2, Document 3, Document 4 |
+| says | Document 5 |
+| moo | Document 7 |
+
+
+Today, text data is often represented as vector embeddings. The IVF method
defines cluster centers and these centers are analogous to the dictionary of
words in the preceding example. For each cluster center, you have a list of
vector indices that belong to the cluster, and search is accelerated because
you only have to inspect the selected clusters.
+
+
+## Using IVF indexes for efficient vector search
+
+As datasets grow to millions or even billions of vectors, performing an
exhaustive exact k-nearest neighbor (kNN) search, calculating the distance
between a query and every single vector in the database becomes computationally
prohibitive. This brute-force approach, equivalent to a large matrix
multiplication, doesn't scale.
+
+Fortunately, many applications can trade a small amount of accuracy for a
massive gain in speed. This is the domain of Approximate Nearest Neighbor (ANN)
search, and the Inverted File (IVF) index is one of the most widely used and
effective ANN methods.
+
+The fundamental principle behind IVF is "partition and conquer." Instead of
searching the entire dataset, IVF intelligently narrows the search scope to a
few promising regions, drastically reducing the number of comparisons needed.
+
+IVF works by partitioning a large dataset of vectors into smaller, manageable
clusters, each represented by a central point called a "centroid." These
centroids act as anchors for their respective partitions. During a search, the
system quickly identifies the clusters whose centroids are closest to the query
vector and only searches within those, ignoring the rest of the dataset.
+
+
+
+
+
+## IVF in Apache Doris
+
+Apache Doris supports building IVF‑based ANN indexes starting from version 4.x.
+
+### Index Construction
+
+The index type used here is ANN. There are two ways to create an ANN index:
you can define it when creating the table, or you can use the `CREATE/BUILD
INDEX` syntax. The two approaches differ in how and when the index is built,
and therefore fit different scenarios.
+
+Approach 1: define an ANN index on a vector column when creating the table. As
data is loaded, an ANN index is built for each segment as it is created. The
advantage is that once data loading completes, the index is already built and
queries can immediately use it for acceleration. The downside is that
synchronous index building slows down data ingestion and may cause extra index
rebuilds during compaction, leading to some waste of resources.
+
+```sql
+CREATE TABLE sift_1M (
+ id int NOT NULL,
+ embedding array<float> NOT NULL COMMENT "",
+ INDEX ann_index (embedding) USING ANN PROPERTIES(
+ "index_type"="ivf",
+ "metric_type"="l2_distance",
+ "dim"="128",
+ "nlist"="1024"
+ )
+) ENGINE=OLAP
+DUPLICATE KEY(id) COMMENT "OLAP"
+DISTRIBUTED BY HASH(id) BUCKETS 1
+PROPERTIES (
+ "replication_num" = "1"
+);
+
+INSERT INTO sift_1M
+SELECT *
+FROM S3(
+ "uri" =
"https://selectdb-customers-tools-bj.oss-cn-beijing.aliyuncs.com/sift_database.tsv",
+ "format" = "csv");
+```
+
+#### CREATE/BUILD INDEX
+
+Approach 2: `CREATE/BUILD INDEX`.
+
+```sql
+CREATE TABLE sift_1M (
+ id int NOT NULL,
+ embedding array<float> NOT NULL COMMENT ""
+) ENGINE=OLAP
+DUPLICATE KEY(id) COMMENT "OLAP"
+DISTRIBUTED BY HASH(id) BUCKETS 1
+PROPERTIES (
+ "replication_num" = "1"
+);
+
+INSERT INTO sift_1M
+SELECT *
+FROM S3(
+ "uri" =
"https://selectdb-customers-tools-bj.oss-cn-beijing.aliyuncs.com/sift_database.tsv",
+ "format" = "csv");
+```
+
+After data is loaded, you can run `CREATE INDEX`. At this point the index is
defined on the table, but no index is yet built for the existing data.
+
+```sql
+CREATE INDEX idx_test_ann ON sift_1M (`embedding`) USING ANN PROPERTIES (
+ "index_type"="ivf",
+ "metric_type"="l2_distance",
+ "dim"="128",
+ "nlist"="1024"
+);
+
+SHOW DATA ALL FROM sift_1M;
+
+mysql> SHOW DATA ALL FROM sift_1M;
++-----------+-----------+--------------+----------+----------------+---------------+----------------+-----------------+----------------+-----------------+
+| TableName | IndexName | ReplicaCount | RowCount | LocalTotalSize |
LocalDataSize | LocalIndexSize | RemoteTotalSize | RemoteDataSize |
RemoteIndexSize |
++-----------+-----------+--------------+----------+----------------+---------------+----------------+-----------------+----------------+-----------------+
+| sift_1M | sift_1M | 10 | 1000000 | 170.093 MB | 170.093
MB | 0.000 | 0.000 | 0.000 | 0.000 |
+| | Total | 10 | | 170.093 MB | 170.093
MB | 0.000 | 0.000 | 0.000 | 0.000 |
++-----------+-----------+--------------+----------+----------------+---------------+----------------+-----------------+----------------+-----------------+
+```
+
+Then you can build the index using the `BUILD INDEX` statement:
+
+```sql
+BUILD INDEX idx_test_ann ON sift_1M;
+```
+
+`BUILD INDEX` is executed asynchronously. You can use `SHOW BUILD INDEX` (in
some versions `SHOW ALTER`) to check the job status.
+
+
+```sql
+SHOW BUILD INDEX WHERE TableName = "sift_1M";
+
+mysql> SHOW BUILD INDEX WHERE TableName = "sift_1M";
++---------------+-----------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+
+| JobId | TableName | PartitionName | AlterInvertedIndexes
| CreateTime | FinishTime
| TransactionId | State | Msg | Progress |
++---------------+-----------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+
+| 1764392359610 | sift_1M | sift_1M | [ADD INDEX idx_test_ann
(`embedding`) USING ANN PROPERTIES("dim" = "128", "index_type" = "ivf",
"metric_type" = "l2_distance", "nlist" = "1024")], | 2025-12-01 14:18:22.360 |
2025-12-01 14:18:27.885 | 5036 | FINISHED | | NULL |
++---------------+-----------+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+-------------------------+---------------+----------+------+----------+
+1 row in set (0.00 sec)
+
+mysql> SHOW DATA ALL FROM sift_1M;
++-----------+-----------+--------------+----------+----------------+---------------+----------------+-----------------+----------------+-----------------+
+| TableName | IndexName | ReplicaCount | RowCount | LocalTotalSize |
LocalDataSize | LocalIndexSize | RemoteTotalSize | RemoteDataSize |
RemoteIndexSize |
++-----------+-----------+--------------+----------+----------------+---------------+----------------+-----------------+----------------+-----------------+
+| sift_1M | sift_1M | 10 | 1000000 | 671.084 MB | 170.093
MB | 500.991 MB | 0.000 | 0.000 | 0.000 |
+| | Total | 10 | | 671.084 MB | 170.093
MB | 500.991 MB | 0.000 | 0.000 | 0.000 |
++-----------+-----------+--------------+----------+----------------+---------------+----------------+-----------------+----------------+-----------------+
+2 rows in set (0.00 sec)
+```
+
+#### DROP INDEX
+
+You can drop an unsuitable ANN index with `ALTER TABLE sift_1M DROP INDEX
idx_test_ann`. Dropping and recreating indexes is common during hyperparameter
tuning, when you need to test different parameter combinations to achieve the
desired recall.
+
+
+### Querying
+
+ANN indexes support both Top‑N search and range search.
+
+When the vector column has high dimensionality, the literal representation of
the query vector itself can incur extra parsing overhead. Therefore, directly
embedding the full query vector into raw SQL is not recommended in production,
especially under high concurrency. A better practice is to use prepared
statements, which avoid repetitive SQL parsing.
+
+We recommend using the
[doris-vector-search](https://github.com/uchenily/doris_vector_search) python
library, which wraps the necessary operations for vector search in Doris based
on prepared statements, and includes data conversion utilities that map Doris
query results into Pandas `DataFrame`s for convenient downstream AI application
development.
+
+
+```python
+from doris_vector_search import DorisVectorClient, AuthOptions
+
+auth = AuthOptions(
+ host="127.0.0.1",
+ query_port=9030,
+ user="root",
+ password="",
+)
+
+client = DorisVectorClient(database="test", auth_options=auth)
+
+tbl = client.open_table("sift_1M")
+
+query = [0.1] * 128 # Example 128-dimensional vector
+
+# SELECT id FROM sift_1M ORDER BY l2_distance_approximate(embedding, query)
LIMIT 10;
+result = tbl.search(query,
metric_type="l2_distance").limit(10).select(["id"]).to_pandas()
+
+print(result)
+```
+
+
+Sample output:
+
+```text
+ id
+0 123911
+1 926855
+2 123739
+3 73311
+4 124493
+5 153178
+6 126138
+7 123740
+8 125741
+9 124048
+```
+
+
+### Recall Optimization
+
+
+In vector search, recall is the most important metric; performance numbers
only make sense under a given recall level. The main factors that affect recall
are:
+
+1. Index‑time parameter of IVF (`nlist`) and query-time parameter (`nprobe`).
+2. Vector quantization.
+3. Segment size and the number of segments.
+
+This article focuses on the impact of (1) and (3) on recall. Vector
quantization will be covered in a separate document.
+
+
+#### Index Hyperparameters
+
+An IVF index organizes vectors into multiple clusters. During index
construction, vectors are partitioned into groups using clustering. The search
process then focuses only on the most relevant clusters. The workflow is
roughly as follows:
+
+At index time:
+
+1. **Clustering**: All vectors are partitioned into `nlist` clusters using a
clustering algorithm (e.g., k‑means). The centroid of each cluster is computed
and stored.
+2. **Vector assignment**: Each vector is assigned to the cluster whose
centroid is closest to it, and the vector is added to that cluster’s inverted
list.
+
+At query time:
+
+1. **Cluster selection using nprobe**: For a query vector, distances to all
`nlist` centroids are computed. Only the `nprobe` closest clusters are selected
for searching.
+2. **Exhaustive search within selected clusters**: The query is compared
against every vector in the selected nprobe clusters to find the nearest
neighbors.
+
+In summary:
+
+`nlist` defines the number of clusters (inverted lists). It affects recall,
memory overhead, and build time. A larger `nlist` creates finer‑grained
clusters, which can improve search speed if the query’s nearest neighbors are
well‑localized, but it also increases the cost of clustering and the risk of
neighbors being spread across multiple clusters.
+
+`nprobe` defines the number of clusters to search during a query. A larger
`nprobe` increases recall and query latency (more vectors are examined). A
smaller nprobe makes queries faster but may miss neighbors that reside in
non‑probed clusters.
+
+
+By default, Doris uses `nlist = 1024` and `nprobe = 64`.
+
+
+The above is a qualitative analysis of these two hyperparameters. The
following table shows empirical results on the SIFT_1M dataset:
+
+
+| nlist | nprobe | recall_at_100 |
+| ----- | ------ | ------------- |
+| 1024 | 64 | 0.9542 |
+| 1024 | 32 | 0.9034 |
+| 1024 | 16 | 0.8299 |
+| 1024 | 8 | 0.7337 |
+| 512 | 32 | 0.9384 |
+| 512 | 16 | 0.8763 |
+| 512 | 8 | 0.7869 |
+
+
+It is hard to provide one single optimal setting in advance, but you can
follow a practical workflow for hyperparameter selection:
+
+1. Create a table `table_multi_index` without indexes. It can contain 2 or 3
vector columns.
+2. Load data into `table_multi_index` using Stream Load or other ingestion
methods.
+3. Use `CREATE INDEX` and `BUILD INDEX` to build ANN indexes on all vector
columns.
+4. Use different index parameter configurations on different columns. After
index building finishes, compute recall on each column and choose the best
parameter combination.
+
+for exmaple:
+
+```bash
+ALTER TABLE tbl DROP INDEX idx_embedding;
+CREATE INDEX idx_embedding ON tbl (`embedding`) USING ANN PROPERTIES (
+ "index_type"="ivf",
+ "metric_type"="inner_product",
+ "dim"="768",
+ "nlist"="1024"
+);
+BUILD INDEX idx_embedding ON tbl;
+```
+
+
+#### Number of Rows Covered per Index
+
+
+Internally, Doris organizes data in multiple layers.
+
+- At the top is a **table**, which is partitioned into N **tablets** using a
distribution key. Tablets serve as units for data sharding, relocation, and
rebalance.
+- Each data ingestion or compaction produces a new **rowset** under a tablet.
A rowset is a versioned collection of data.
+- Data in a rowset is actually stored in **segment** files.
+
+Similar to inverted indexes, vector indexes are built at the **segment**
level. The segment size is determined by BE configuration options like
`write_buffer_size` and `vertical_compaction_max_segment_size`. During
ingestion and compaction, when the in‑memory memtable reaches a certain size,
it is flushed to disk as a segment file, and a vector index (or multiple
indexes for multiple vector columns) is built for that segment. The index only
covers the rows in that segment.
+
+Given a fixed set of IVF parameters, there is always a limit to the number of
vectors for which the index can still maintain high recall. Once the number of
vectors in a segment grows beyond that limit, recall starts to degrade.
+
+
+
+> You can use `SHOW TABLETS FROM table` to inspect the compaction status of a
table. By following the corresponding URL, you can see how many segments it has.
+
+#### Impact of Compaction on Recall
+
+Compaction can affect recall because it may create larger segments, which can
exceed the “coverage capacity” implied by the original hyperparameters. As a
result, the recall level achieved before compaction may no longer hold after
compaction.
+
+We recommend triggering a full compaction before running `BUILD INDEX`.
Building indexes on fully compacted segments stabilizes recall and also reduces
write amplification caused by index rebuilds.
+
+### Query Performance
+
+#### Cold Loading of Index Files
+
+The IVF ANN index in Doris is implemented using Meta’s open‑source library
[Faiss](https://github.com/facebookresearch/faiss). IVF indexes become
effective after being loaded into memory. Therefore, before running
high‑concurrency workloads, it is recommended to run some warm‑up queries to
make sure that all relevant segment indexes are loaded into memory; otherwise,
disk I/O overhead can significantly hurt query performance.
+
+#### Memory Footprint vs. Performance
+
+Without quantization or compression, the memory footprint of an IVF index is
roughly 1.02-1.1× the memory footprint of all vectors it indexes.
+
+For example, with 1 million 128‑dimensional vectors, an IVF-FLAT index
requires approximately:
+
+`128 * 4 * 1,000,000 * 1.02 ≈ 500 MB`.
+
+Some reference values:
+
+| dim | rows | estimated memory |
+|-----|------|------------------|
+| 128 | 1M | 496 MB |
+| 768 | 1M | 2.9 GB |
+
+To maintain stable performance, ensure that each BE has enough memory;
otherwise, frequent swapping and I/O on index files will severely degrade query
latency.
+
+### Benchmark
+
+When benchmark, the deployment model should follow the production environment
setup, with FE and BE deployed separately, and the client should run on another
independent machine.
+
+You can use [VectorDBBench](https://github.com/zilliztech/VectorDBBench) as
benchmark framekwork.
+
+#### Performance768D1M
+
+Benchmark command:
+
+```bash
+# load
+NUM_PER_BATCH=1000000 python3 -m vectordbbench doris --host 127.0.0.1 --port
9030 --case-type Performance768D1M --db-name Performance768D1M
--stream-load-rows-per-batch 500000 --index-prop index_type=ivf,nlist=1024
--skip-search-serial --skip-search-concurrent
+
+# search
+NUM_PER_BATCH=1000000 python3 -m vectordbbench doris --host 127.0.0.1 --port
9030 --case-type Performance768D1M --db-name Performance768D1M
--search-concurrent --search-serial --num-concurrency 10,40,80
--stream-load-rows-per-batch 500000 --index-prop index_type=ivf,nlist=1024
--session-var ivf_nprobe=64 --skip-load --skip-drop-old
+```
+
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]