This is an automated email from the ASF dual-hosted git repository.
morningman pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new efc22760fd0 rm Brute-Force Search (#3238)
efc22760fd0 is described below
commit efc22760fd01faf1ffaf094ced7b7730bf945ce7
Author: zhiqiang <[email protected]>
AuthorDate: Wed Jan 7 15:32:59 2026 +0800
rm Brute-Force Search (#3238)
---
docs/ai/vector-search/overview.md | 13 -------------
.../current/ai/vector-search/overview.md | 11 -----------
.../version-4.x/ai/vector-search/overview.md | 11 -----------
versioned_docs/version-4.x/ai/vector-search/overview.md | 13 -------------
4 files changed, 48 deletions(-)
diff --git a/docs/ai/vector-search/overview.md
b/docs/ai/vector-search/overview.md
index 92aaec324cd..539f0a56da8 100644
--- a/docs/ai/vector-search/overview.md
+++ b/docs/ai/vector-search/overview.md
@@ -34,19 +34,6 @@ To achieve this, we need a mechanism to measure semantic
relatedness between a u
Vector retrieval in RAG is not limited to text; it naturally extends to
multimodal scenarios. In a multimodal RAG system, images, audio, video, and
other data types can also be encoded into vectors for retrieval and then
supplied to the generative model as context. For example, if a user uploads an
image, the system can first retrieve related descriptions or knowledge
snippets, then generate explanatory content. In medical QA, RAG can retrieve
patient records and literature to support mo [...]
-## Brute-Force Search
-
-Starting from version 2.0, Apache Doris supports nearest-neighbor search based
on vector distance. Performing vector search with SQL is natural and simple:
-
-```sql
-SELECT id, l2_distance(embedding, [1.0, 2.0, xxx, 10.0]) AS distance
-FROM vector_table
-ORDER BY distance
-LIMIT 10;
-```
-
-When the dataset is small (under ~1 million rows), Doris’s exact K-Nearest
Neighbor search performance is sufficient, providing 100% recall and precision.
As the dataset grows, however, most users are willing to trade a small amount
of recall/accuracy for significantly lower latency. The problem then becomes
Approximate Nearest Neighbor (ANN) search.
-
## Approximate Nearest Neighbor Search
From version 4.0, Apache Doris officially supports ANN search. No additional
data type is introduced: vectors are stored as fixed-length arrays. For
distance-based indexing a new index type, ANN, is implemented based on Faiss.
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md
index a37ae8a7b84..33debeba859 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/ai/vector-search/overview.md
@@ -27,17 +27,6 @@ under the License.
-->
在生成式 AI
的应用中,单纯依赖大模型自身的参数“记忆”存在明显局限:一方面,模型知识具有时效性,无法覆盖最新信息;另一方面,完全依赖模型直接“生成”容易产生幻觉(Hallucination)。因此,RAG(检索增强生成)应运而生。其核心目标不是让模型凭空构造答案,而是从外部知识库中检索出与用户查询最相关的
Top-K
信息片段,作为生成依据。为实现这一点,需要一种机制衡量“用户查询”与“知识库文档”之间的语义相关性。向量表示正是常用手段:将查询与文档统一编码为语义向量后,可通过向量相似度衡量相关程度。随着预训练模型的发展,生成高质量语义向量已成主流,RAG
的检索阶段也演化为一个标准的向量相似度搜索问题——从大规模向量集合中找出与查询最相似的 K 个向量(候选知识片段)。需要注意,RAG 的向量检索不�
�文本,也可扩展到多模态:图片、语音、视频等数据同样可以编码为向量供生成模型使用。例如,用户上传图片后,系统先检索相关描述或知识片段,再辅助生成解释性内容;在医学问答中,可检索病例资料与医学文献,生成更准确的诊断建议。
-## 暴力搜索
-Apache Doris 自 2.0 版本起支持基于向量距离的最近邻搜索,通过 SQL 实现向量搜索是一个自然且简单的过程。
-
-```sql
-SELECT id, l2_distance(embedding, [1.0, 2.0, xxx, 10.0]) AS distance
-FROM vector_table
-ORDER BY distance
-LIMIT 10;
-```
-
-当数据量不大(小于 100 万行)时,Apache Doris 的精确最近邻(K-Nearest Neighbor)搜索性能足以满足需求,可获得 100%
召回与 100% 精确。但随着数据进一步增长,用户通常愿意牺牲少量召回与精度以换取显著的查询加速,此时问题就转化为向量近似最近邻搜索(Approximate
Nearest Neighbor,ANN)。
## 近似最近邻搜索
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md
index b53f43ce802..e7d84f464e4 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/ai/vector-search/overview.md
@@ -27,17 +27,6 @@ under the License.
-->
在生成式 AI
的应用中,单纯依赖大模型自身的参数“记忆”存在明显局限:一方面,模型知识具有时效性,无法覆盖最新信息;另一方面,完全依赖模型直接“生成”容易产生幻觉(Hallucination)。因此,RAG(检索增强生成)应运而生。其核心目标不是让模型凭空构造答案,而是从外部知识库中检索出与用户查询最相关的
Top-K
信息片段,作为生成依据。为实现这一点,需要一种机制衡量“用户查询”与“知识库文档”之间的语义相关性。向量表示正是常用手段:将查询与文档统一编码为语义向量后,可通过向量相似度衡量相关程度。随着预训练模型的发展,生成高质量语义向量已成主流,RAG
的检索阶段也演化为一个标准的向量相似度搜索问题——从大规模向量集合中找出与查询最相似的 K 个向量(候选知识片段)。需要注意,RAG 的向量检索不�
�文本,也可扩展到多模态:图片、语音、视频等数据同样可以编码为向量供生成模型使用。例如,用户上传图片后,系统先检索相关描述或知识片段,再辅助生成解释性内容;在医学问答中,可检索病例资料与医学文献,生成更准确的诊断建议。
-## 暴力搜索
-Apache Doris 自 2.0 版本起支持基于向量距离的最近邻搜索,通过 SQL 实现向量搜索是一个自然且简单的过程。
-
-```sql
-SELECT id, l2_distance(embedding, [1.0, 2.0, xxx, 10.0]) AS distance
-FROM vector_table
-ORDER BY distance
-LIMIT 10;
-```
-
-当数据量不大(小于 100 万行)时,Apache Doris 的精确最近邻(K-Nearest Neighbor)搜索性能足以满足需求,可获得 100%
召回与 100% 精确。但随着数据进一步增长,用户通常愿意牺牲少量召回与精度以换取显著的查询加速,此时问题就转化为向量近似最近邻搜索(Approximate
Nearest Neighbor,ANN)。
## 近似最近邻搜索
diff --git a/versioned_docs/version-4.x/ai/vector-search/overview.md
b/versioned_docs/version-4.x/ai/vector-search/overview.md
index 0098e87b6a1..5d2cd669100 100644
--- a/versioned_docs/version-4.x/ai/vector-search/overview.md
+++ b/versioned_docs/version-4.x/ai/vector-search/overview.md
@@ -34,19 +34,6 @@ To achieve this, we need a mechanism to measure semantic
relatedness between a u
Vector retrieval in RAG is not limited to text; it naturally extends to
multimodal scenarios. In a multimodal RAG system, images, audio, video, and
other data types can also be encoded into vectors for retrieval and then
supplied to the generative model as context. For example, if a user uploads an
image, the system can first retrieve related descriptions or knowledge
snippets, then generate explanatory content. In medical QA, RAG can retrieve
patient records and literature to support mo [...]
-## Brute-Force Search
-
-Starting from version 2.0, Apache Doris supports nearest-neighbor search based
on vector distance. Performing vector search with SQL is natural and simple:
-
-```sql
-SELECT id, l2_distance(embedding, [1.0, 2.0, xxx, 10.0]) AS distance
-FROM vector_table
-ORDER BY distance
-LIMIT 10;
-```
-
-When the dataset is small (under ~1 million rows), Doris’s exact K-Nearest
Neighbor search performance is sufficient, providing 100% recall and precision.
As the dataset grows, however, most users are willing to trade a small amount
of recall/accuracy for significantly lower latency. The problem then becomes
Approximate Nearest Neighbor (ANN) search.
-
## Approximate Nearest Neighbor Search
From version 4.0, Apache Doris officially supports ANN search. No additional
data type is introduced: vectors are stored as fixed-length arrays. For
distance-based indexing a new index type, ANN, is implemented based on Faiss.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]