(doris-website) branch master updated: add spill disk doc (#2834)

yiguolei Thu, 04 Sep 2025 10:11:50 -0700

This is an automated email from the ASF dual-hosted git repository.

yiguolei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git



The following commit(s) were added to refs/heads/master by this push:
     new 88f26b8247d add spill disk doc (#2834)
88f26b8247d is described below

commit 88f26b8247d040b3bddeefc4cabd502f74317ecf
Author: yiguolei <[email protected]>
AuthorDate: Thu Sep 4 14:06:51 2025 +0800

    add spill disk doc (#2834)
    
    ## Versions
    
    - [x ] dev
    - [ ] 3.0
    - [ ] 2.1
    - [ ] 2.0
    
    ## Languages
    
    - [x ] Chinese
    - [ x] English
    
    ## Docs Checklist
    
    - [ x] Checked by AI
    - [ ] Test Cases Built
    
    Co-authored-by: yiguolei <[email protected]>
---
 .../admin-manual/workload-management/spill-disk.md | 240 ++++++-----
 .../admin-manual/workload-management/spill-disk.md | 437 +++++++++++----------
 sidebars.json                                      |   1 +
 3 files changed, 341 insertions(+), 337 deletions(-)

diff --git a/docs/admin-manual/workload-management/spill-disk.md 
b/docs/admin-manual/workload-management/spill-disk.md
index 92a90c10706..0201202e3b5 100644
--- a/docs/admin-manual/workload-management/spill-disk.md
+++ b/docs/admin-manual/workload-management/spill-disk.md
@@ -25,9 +25,6 @@ Currently, the operators that support spilling include:
 When a query triggers spilling, additional disk read/write operations may 
significantly increase query time. It is recommended to increase the FE Session 
variable query_timeout. Additionally, spilling can generate significant disk 
I/O, so it is advisable to configure a separate disk directory or use SSD disks 
to reduce the impact of query spilling on normal data ingestion or queries. The 
query spilling feature is currently disabled by default.
 
 ## Memory Management Mechanism
-Doris's memory management is divided into three levels: process level, 
Workload Group level, and Query level.
-
-![Memory Management Mechanism Spill Disk 
Memory](/images/workload-management/spill_disk_memory.png)
 
 ### BE Process Memory Configuration
 The memory of the entire BE process is controlled by the mem_limit parameter 
in be.conf. Once Doris's memory usage exceeds this threshold, Doris cancels the 
current query that is requesting memory. Additionally, a background task 
asynchronously kills some queries to release memory or cache. Therefore, 
Doris's internal management operations (such as spilling to disk, flushing 
memtable, etc.) need to run when approaching this threshold to avoid reaching 
it. Once the threshold is reached, t [...]
@@ -35,9 +32,10 @@ When Doris's BE is collocated with other processes (such as 
Doris FE, Kafka, HDF
 When the Doris process is deployed in K8S or managed by Cgroup, Doris 
automatically senses the memory configuration of the container.
 
 ### Workload Group Memory Configuration
-- memory_limit, default is 30%. Represents the percentage of memory allocated 
to the current workload group as a fraction of the entire process memory.
-- enable_memory_overcommit, default is true. Indicates whether the memory 
limit for the current workload group is a hard or soft limit. When this value 
is true, it means that the memory usage of all tasks within this workload group 
can exceed the memory_limit. However, when the memory of the entire process is 
insufficient, to ensure rapid memory reclamation, BE will prioritize canceling 
queries from workload groups that exceed their limits without waiting for 
spilling to disk. This is a  [...]
-- write_buffer_ratio, default is 20%. Represents the size of the write buffer 
within the current workload group. To speed up data ingestion, Doris first 
accumulates data in memory (i.e., constructs a Memtable), sorts it in its 
entirety when it reaches a certain size, and then writes it to disk. However, 
accumulating too many Memtables in memory can affect the memory available for 
normal queries, leading to query cancellation. Therefore, Doris allocates a 
separate write buffer for each wo [...]
+
+- MAX_MEMORY_PERCENT means that when requests are running in the group, their 
memory usage will never exceed this percentage of the total memory. Once 
exceeded, the query will either trigger disk spilling or be killed.
+- MIN_MEMORY_PERCENT sets the minimum memory value for a group. When resources 
are idle, memory exceeding MIN_MEMORY_PERCENT can be used. However, when memory 
is insufficient, the system will allocate memory according to 
MIN_MEMORY_PERCENT (minimum memory percentage). It may select some queries to 
kill, reducing the memory usage of the Workload Group to MIN_MEMORY_PERCENT to 
ensure that other Workload Groups have sufficient memory available.
+- The sum of MIN_MEMORY_PERCENT across all Workload Groups must not exceed 
100%, and MIN_MEMORY_PERCENT cannot be greater than MAX_MEMORY_PERCENT.
 - low watermark: Default is 75%.
 - high watermark: Default is 90%.
 
@@ -45,7 +43,6 @@ When the Doris process is deployed in K8S or managed by 
Cgroup, Doris automatica
 ### Static Memory Allocation
 The memory used by a query is controlled by the following two parameters:
 - exec_mem_limit, representing the maximum memory that a query can use, with a 
default value of 2GB.
-- enable_mem_overcommit, default is true. Indicates whether the memory used by 
a query can exceed the exec_mem_limit. The default value is true, meaning it 
can exceed this limit. When the process memory is insufficient, queries that 
exceed the memory limit will be killed. If set to false, the query's memory 
usage cannot exceed this limit. When exceeded, spilling to disk or query 
killing will occur based on user settings. These two parameters must be set by 
the user in the session variabl [...]
 
 ### Slot-Based Memory Allocation
 In practice, we found that with static memory allocation, users often do not 
know how much memory to allocate to a query. Therefore, exec_mem_limit is 
frequently set to half of the entire BE process memory, meaning that the memory 
used by all queries within the BE cannot exceed half of the process memory. In 
this scenario, this feature effectively becomes a kind of fuse. When we need to 
implement more granular policy control based on memory size, such as spilling 
to disk, this value is t [...]
@@ -73,16 +70,14 @@ spill_storage_limit=100%
 ```
 set enable_spill=true;
 set exec_mem_limit = 10g
-set enable_mem_overcommit = false
 ```
 - enable_spill, indicates whether spilling is enabled for a query.
 - exec_mem_limit, represents the maximum memory size used by a query.
-- enable_mem_overcommit, indicates whether a query can use memory exceeding 
the exec_mem_limit.
 
 #### Workload Group
-The default memory_limit for workload groups is 30%, which can be adjusted 
based on the actual number of workload groups. If there is only one workload 
group, it can be adjusted to 90%.
+The default max_memory_percent for workload groups is 100%, which can be 
adjusted based on the actual number of workload groups. If there is only one 
workload group, it can be adjusted to 90%.
 ```
-alter workload group normal properties ( 'memory_limit'='90%' );
+alter workload group normal properties ( 'max_memory_percent'='90%' );
 ```
 
 ### Monitoring Spilling
@@ -142,18 +137,6 @@ mysql [information_schema]>select * from 
backend_active_tasks;
 2 rows in set (0.00 sec)
 ```
 
-##### workload_group_resource_usage
-The WRITE_BUFFER_USAGE_BYTES field has been added, representing the memory 
usage of Memtables for ingestion tasks within the workload group.
-```
-mysql [information_schema]>select * from workload_group_resource_usage;
-+-------+-------------------+--------------------+-------------------+-----------------------------+------------------------------+--------------------------+
-| BE_ID | WORKLOAD_GROUP_ID | MEMORY_USAGE_BYTES | CPU_USAGE_PERCENT | 
LOCAL_SCAN_BYTES_PER_SECOND | REMOTE_SCAN_BYTES_PER_SECOND | 
WRITE_BUFFER_USAGE_BYTES |
-+-------+-------------------+--------------------+-------------------+-----------------------------+------------------------------+--------------------------+
-| 10009 |                 1 |          102314948 |              0.69 |         
                  0 |                            0 |                 23404836 |
-+-------+-------------------+--------------------+-------------------+-----------------------------+------------------------------+--------------------------+
-1 row in set (0.01 sec)
-```
-
 ## Testing
 ### Test Environment
 #### Machine Configuration
@@ -169,7 +152,7 @@ The test used Alibaba Cloud servers with the following 
specific configurations:
 16 cores(vCPU) 64 GiB 0 Mbps ecs.g6.4xlarge
 ```
 
-#### 测试数据
+#### Dataset
 The test data used TPC-DS 10TB as input, sourced from Alibaba Cloud DLF, and 
mounted to Doris using the Catalog method. The SQL statement is as follows:
 ```
 CREATE CATALOG dlf PROPERTIES (
@@ -179,9 +162,7 @@ CREATE CATALOG dlf PROPERTIES (
 "dlf.endpoint" = "dlf-vpc.cn-beijing.aliyuncs.com",
 "dlf.region" = "cn-beijing",
 "dlf.uid" = "217316283625971977",
-"dlf.catalog.id" = "emr_dev",
-"dlf.access_key" = "fill in as applicable",
-"dlf.secret_key" = "fill in as applicable"
+"dlf.catalog.id" = "emr_dev"
 );
 ```
 
@@ -190,106 +171,105 @@ Reference website: 
https://doris.apache.org/zh-CN/docs/dev/benchmark/tpcds
 ### Test Results
 The dataset size was 10TB. The ratio of memory to dataset size was 1:52. The 
overall runtime was 32,000 seconds, and all 99 queries were successfully 
executed. In the future, we will provide spilling capabilities for more 
operators (such as window functions, Intersect, etc.) and continue to optimize 
performance under spilling conditions, reduce disk consumption, and improve 
query stability.
 
-| query   |Time(ms)|
-|---------|---------|
-| query1  |25590|
-| query2  |126445|
-| query3  |103859|
-| query4  |1174702|
-| query5  |266281|
-| query6  |62950|
-| query7  |212745|
-| query8  |67000|
-| query9  |602291|
-| query10 |70904|
-| query11 |544436|
-| query12 |25759|
-| query13 |229144|
-| query14 |1120895|
-| query15 |29409|
-| query16 |117287|
-| query17 |260122|
-| query18 |97453|
-| query19 |127384|
-| query20 |32749|
-| query21 |4471|
-| query22 |10162|
-| query23 |1772561|
-| query24 |535506|
-| query25 |272458|
-| query26 |83342|
-| query27 |175264|
-| query28 |887007|
-| query29 |427229|
-| query30 |13661|
-| query31 |108778|
-| query32 |37303|
-| query33 |181351|
-| query34 |84159|
-| query35 |81701|
-| query36 |152990|
-| query37 |36815|
-| query38 |172531|
-| query39 |20155|
-| query40 |75749|
-| query41 |527|
-| query42 |95910|
-| query43 |66821|
-| query44 |209947|
-| query45 |26946|
-| query46 |131490|
-| query47 |158011|
-| query48 |149482|
-| query49 |303515|
-| query50 |298089|
-| query51 |156487|
-| query52 |97440|
-| query53 |98258|
-| query54 |202583|
-| query55 |93268|
-| query56 |185255|
-| query57 |80308|
-| query58 |252746|
-| query59 |171545|
-| query60 |202915|
-| query61 |272184|
-| query62 |38749|
-| query63 |94327|
-| query64 |247074|
-| query65 |270705|
-| query66 |101465|
-| query67 |3744186|
-| query68 |151543|
-| query69 |15559|
-| query70 |132505|
-| query71 |180079|
-| query72 |3085373|
-| query73 |82623|
-| query74 |330087|
-| query75 |830993|
-| query76 |188805|
-| query77 |239730|
-| query78 |1895765|
-| query79 |144829|
-| query80 |463652|
-| query81 |15319|
-| query82 |76961|
-| query83 |32437|
-| query84 |22849|
-| query85 |58186|
-| query86 |33933|
-| query87 |185421|
-| query88 |434867|
-| query89 |108265|
-| query90 |31131|
-| query91 |18864|
-| query92 |24510|
-| query93 |281904|
-| query94 |67761|
-| query95 |3738968|
-| query96 |47245|
-| query97 |536702|
-| query98 |97800|
-| query99 |62210|
-| sum     |31797707|
-
+| Query    | Doris |
+| ---------- | ------- |
+| query1   | 29092 |
+| query2   | 130003 |
+| query3   | 96119 |
+| query4   | 1199097 |
+| query5   | 212719 |
+| query6   | 62259 |
+| query7   | 209154 |
+| query8   | 62433 |
+| query9   | 579371 |
+| query10  | 54260 |
+| query11  | 560169 |
+| query12  | 26084 |
+| query13  | 228756 |
+| query14  | 1137097 |
+| query15  | 27509 |
+| query16  | 84806 |
+| query17  | 288164 |
+| query18  | 94770 |
+| query19  | 124955 |
+| query20  | 30970 |
+| query21  | 4333 |
+| query22  | 9890 |
+| query23  | 1757755 |
+| query24  | 399553 |
+| query25  | 291474 |
+| query26  | 79832 |
+| query27  | 175894 |
+| query28  | 647497 |
+| query29  | 1299597 |
+| query30  | 11434 |
+| query31  | 106665 |
+| query32  | 33481 |
+| query33  | 146101 |
+| query34  | 84055 |
+| query35  | 69885 |
+| query36  | 148662 |
+| query37  | 21598 |
+| query38  | 164746 |
+| query39  | 5874 |
+| query40  | 51602 |
+| query41  | 563 |
+| query42  | 93005 |
+| query43  | 67769 |
+| query44  | 79527 |
+| query45  | 26575 |
+| query46  | 134991 |
+| query47  | 161873 |
+| query48  | 153657 |
+| query49  | 259387 |
+| query50  | 141421 |
+| query51  | 158056 |
+| query52  | 91392 |
+| query53  | 89497 |
+| query54  | 124118 |
+| query55  | 82584 |
+| query56  | 152110 |
+| query57  | 83417 |
+| query58  | 259580 |
+| query59  | 177125 |
+| query60  | 161729 |
+| query61  | 258058 |
+| query62  | 39619 |
+| query63  | 91258 |
+| query64  | 234882 |
+| query65  | 278610 |
+| query66  | 90246 |
+| query67  | 3939554 |
+| query68  | 183648 |
+| query69  | 11031 |
+| query70  | 137901 |
+| query71  | 166454 |
+| query72  | 2859001 |
+| query73  | 92015 |
+| query74  | 336694 |
+| query75  | 838989 |
+| query76  | 174235 |
+| query77  | 174525 |
+| query78  | 1956786 |
+| query79  | 162259 |
+| query80  | 602088 |
+| query81  | 16184 |
+| query82  | 56292 |
+| query83  | 26211 |
+| query84  | 11906 |
+| query85  | 57739 |
+| query86  | 34350 |
+| query87  | 173631 |
+| query88  | 449003 |
+| query89  | 113799 |
+| query90  | 30825 |
+| query91  | 12239 |
+| query92  | 26695 |
+| query93  | 275828 |
+| query94  | 56464 |
+| query95  | 64932 |
+| query96  | 48102 |
+| query97  | 597371 |
+| query98  | 112399 |
+| query99  | 64472 |
+| Sum      | 28102386 |
diff --git 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/admin-manual/workload-management/spill-disk.md
 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/admin-manual/workload-management/spill-disk.md
index f014c26626c..208cbd6c128 100644
--- 
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/admin-manual/workload-management/spill-disk.md
+++ 
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/admin-manual/workload-management/spill-disk.md
@@ -1,107 +1,136 @@
 ---
 {
-"title": "算子落盘",
+"title": "落盘",
 "language": "zh-CN"
 }
 ---
 
-## 概述
-Doris 的计算层是一个 MPP 的架构，所有的计算任务都是在 BE 的内存中完成的，各个 BE 
之间也是通过内存来完成数据交换，所以内存管理对查询的稳定性有至关重要的影响，从线上查询统计看，有一大部分的查询报错也是跟内存相关。当前越来越多的用户将 ETL 
数据加工，多表物化视图处理，复杂的 AdHoc 查询等任务迁移到 Doris 
上运行，所以，需要将中间操作结果卸载到磁盘上，使那些所需内存量超出每个查询或每个节点限制的查询能够得以执行。具体来说，当处理大型数据集或执行复杂查询时，内存消耗可能会迅速增加，超出单个节点或整个查询处理过程中可用的内存限制。Doris
 通过将其中的中间结果（如聚合的中间状态、排序的临时数据等）写入磁盘，而不是完全依赖内存来存储这些数据，从而缓解了内存压力。这样做有几个好处：
-- 扩展性：允许 Doris 处理比单个节点内存限制大得多的数据集。
-- 稳定性：减少因内存不足导致的查询失败或系统崩溃的风险。
-- 灵活性：使得用户能够在不增加硬件资源的情况下，执行更复杂的查询。
-  
-为了避免申请内存时触发 OOM，Doris 引入了 reserve memory 机制，这个机制的工作流程如下：
-- Doris 在执行过程中，会预估每次处理一个 Block 需要的内存大小，然后到一个统一的内存管理器中去申请；
-- 全局的内存分配器会判断当前内存申请，是否超过了 Query 的内存限制或者超过了整个进程的内存限制，如果超过了，那么就返回失败；
-- Doris 在收到失败消息时，会将当前 Query 挂起，然后选择最大的算子进行落盘，等到落盘结束后，Query 再继续执行。
-  
+# 概述
+
+Doris 
的计算层是一个MPP的架构，所有的计算任务都是在BE的内存中完成的，各个BE之间也是通过内存来完成数据交换，所以内存管理对查询的稳定性有至关重要的影响，从线上查询统计看，有一大部分的查询报错也是跟内存相关。**当前越来越多的用户将ETL
 数据加工，多表物化视图处理，复杂的AdHoc查询**等任务迁移到Doris 
上运行，所以，需要将中间操作结果卸载到磁盘上，使那些所需内存量超出每个查询或每个节点限制的查询能够得以执行。具体来说，当处理大型数据集或执行复杂查询时，内存消耗可能会迅速增加，超出单个节点或整个查询处理过程中可用的内存限制。Doris通过将其中的中间结果（如聚合的中间状态、排序的临时数据等）写入磁盘，而不是完全依赖内存来存储这些数据，从而缓解了内存压力。这样做有几个好处：
+
+* 扩展性：允许Doris处理比单个节点内存限制大得多的数据集。
+* 稳定性：减少因内存不足导致的查询失败或系统崩溃的风险。
+* 灵活性：使得用户能够在不增加硬件资源的情况下，执行更复杂的查询。
+
+为了避免申请内存时触发OOM，Doris 引入了reserve memory机制，这个机制的工作流程如下：
+
+* Doris在执行过程中，会预估每次处理一个Block 需要的内存大小，然后到一个统一的内存管理器中去申请；
+* 全局的内存分配器会判断当前内存申请，是否超过了Query、Workload Group或者整个进程的内存限制，如果超过了，那么就返回失败；
+* Doris 在收到失败消息时，会将当前Query 挂起，然后选择最大的算子进行落盘，等到落盘结束后，Query再继续执行。
+
 目前支持落盘的算子有：
-- Hash Join 算子
-- 聚合算子
-- 排序算子
-- CTE
 
-当查询触发落盘时，由于会有额外的硬盘读写操作，查询时间可能会显著增长，建议调大 FE Session 变量 
query_timeout。同时落盘会有比较大的磁盘 IO，建议单独配置一个磁盘目录或者使用 SSD 
磁盘降低查询落盘对正常的导入或者查询的影响。目前查询落盘功能默认关闭。
+* Hash Join算子
+* 聚合算子
+* 排序算子
+* CTE
+
+当查询触发落盘时，由于会有额外的硬盘读写操作，查询时间可能会显著增长，建议调大FE 
Session变量`query_timeout`。同时落盘会有比较大的磁盘IO，建议单独配置一个磁盘目录或者使用SSD磁盘降低查询落盘对正常的导入或者查询的影响。目前查询落盘功能默认关闭。
 
-## 内存管理机制
-Doris 的内存管理分为三个级别：进程级别、WorkloadGroup 级别、Query 级别。
-![spill_disk_memory](/images/workload-management/spill_disk_memory.png)
+# 内存管理机制
 
-### BE 进程内存配置
-整个 BE 进程的内存由 be.conf 中的参数 mem_limit 控制，一旦 Doris 使用的内存超过这个阈值，Doris 就会把当前正在申请内存的 
Query 取消，同时后台也会有一个定时任务，异步的 Kill 一部分 Query 来释放内存 或者 释放一些 Cache。所以 Doris 
内部的各种管理操作（比如 spill disk，flush memtable 
等）需要在快接近这个阈值的时候，就需要运行，尽可能的避免内存达到这个阈值，一旦到达这个阈值，为了避免整个进程 OOM，Doris 
会采取一些非常暴力的自我保护措施。
-当 Doris 的 BE 跟其他的进程混部（比如 Doris FE、Kafka、HDFS）的时候，会导致 Doris BE 实际可用的内存远小于用户设置的 
mem_limit 导致内部的释放内存机制失效，然后导致 Doris 进程被操作系统的 OOM Killer 杀死。
-当 Doris 进程部署在 K8S 里或者用 Cgroup 管理的时候，Doris 会自动感知容器的内存配置。
+Doris 的内存管理分为三个级别： 进程级别、WorkloadGroup 级别、Query 级别。
 
-### Workload Group 内存配置
-- memory_limit，默认是 30%。表示当前 Workload Group 分配的内存占整个进程内存的百分比。
-- enable_memory_overcommit，默认是 true。表示当前 Workload Group 的内存限制，是硬限还是软限。当这个值为 
true 时，表示这个 Workload Group 内所有的任务使用的内存的大小可以超过 memory_limit 
的限制。但是当整个进程的内存不足时，为了保证能够快速的回收内存，BE 会优先从那些超过自身限制的 Workload Group 中挑选 Query 去 
cancel，此时并不会等待 Spill Disk。当用户不知道如何给多个 Workload Group 设置多少内存时，这种方式是一个比较易用的配置策略。
-- write_buffer_ratio，默认是 20%。表示当前 Workload Group 内 write buffer 的大小。Doris 
为了加快导入速度，数据首先会在内存里攒批（也就是构建 Memtable），然后到一定大小的时候，再整体排序，然后写入硬盘。但是如果内存里积攒太多的 
Memtable 又会影响正常 Query 可用的内存，导致 Query 被 Cancel。所以 Doris 在每个 Workload Group 
内都单独划分了一个 write buffer。对于写入比较大的 Workload Group，可以设置比较大的 write 
buffer，可以有效的提升写入的吞吐；对于查询比较多的 Workload Group 可以调小这个值。
-- low watermark: 默认是 75%。
-- high watermark：默认是 90%.
+## BE 进程内存配置
+
+整个BE 进程的内存由be.conf中的参数`mem_limit` 控制，一旦Doris 使用的内存超过这个阈值，Doris 
就会把当前正在申请内存的Query 取消，同时后台也会有一个定时任务，异步的Kill 一部分 Query来释放内存 或者 释放一些Cache。所以Doris 
内部的各种管理操作（比如spill disk ， flush 
memtable等）需要在快接近这个阈值的时候，就需要运行，尽可能的避免内存达到这个阈值，一旦到达这个阈值，为了避免整个进程OOM，Doris 
会采取一些非常暴力的自我保护措施。
+
+当Doris的BE跟其他的进程混部（比如Doris FE 、Kafka、HDFS）的时候，会导致Doris BE 
实际可用的内存远小于用户设置的`mem_limit` 导致内部的释放内存机制失效，然后导致Doris 进程被操作系统的OOM Killer 杀死。
+
+当Doris 进程部署在K8S里或者用Cgroup 管理的时候，Doris 会自动感知容器的内存配置。
+
+## Workload group 内存配置
+
+* max\_memory\_percent，意味着当请求在该池中运行时，它们占用的内存绝不会超过总内存的这一百分比，一旦超过那么Query 
将会触发落盘或者被Kill。
+* 
min\_memory\_percent，为某个池设置最小内存值，当资源空闲时，可以使用超过MIN\_MEMORY\_PERCENT的内存，但是当内存不足时，系统将按照min\_memory\_percent（最小内存百分比）分配内存，可能会选取一些Query
 Kill，将Workload Group 的内存使用量降低到min\_memory\_percent，以确保其他Workload Group有足够的内存可用。
+* 所有的Workload Group的 MIN\_MEMORY\_PERCENT 之和不能超过 100%，并且 MIN\_MEMORY\_PERCENT 
不能大于 MAX\_MEMORY\_PERCENT。
+* memory\_low\_watermark: 默认是80%。表示当前workload group的内存使用率低水位线。
+* memory\_high\_watermark：默认是95%。表示当前workload group的内存使用率高水位线。workload 
group的内存使用率大于此值时，reserve memory会失败，触发查询落盘。
 
 ## Query 内存管理
+
 ### 静态内存分配
-Query 运行的内存受以下 2 个参数控制：
-- exec_mem_limit，表示一个 query 最大可以使用的内存，默认值 2G；
-- enable_mem_overcommit，默认是 true。表示一个 query 使用的内存是否可以超过 exec_mem_limit 
的限制，默认值是 true，表示是可以超过这个限制的，此时当进程内存不足的时候，会去杀死那些超过内存限制的 query；false 表示 query 
使用的内存不能超过这个限制，当超过的时候，会根据用户的设置选择落盘或者 kill。
-  这两个参数是 query 运行之前用户就需要在 session variable 里设置好，运行期间不能够动态修改。
-
-### 基于 Slot 的内存分配
-静态内存分配方式，在使用过程中我们发现，很多时候用户不知道一个 query 应该分配多少内存，所以经常把 exec_mem_limit 设置为整个 BE 
进程内存的一半，也就是整个 BE 内所有的 query 
使用的内存都不允许超过整个进程内存的一半，这个功能在这种场景下实际变成了一个类似熔断的功能。当我们要根据内存的大小做一些更精细的策略控制，比如 spill 
disk 时，由于这个值太大了，所以不能依赖它来做一些控制。
-所以我们基于 Workload Group 实现了一个新的基于 slot 的内存限制方式，这个策略的原理如下：
-- 每个 Workload Group 用户都配置了 2 个参数，memory_limit 和 max_concurrency，那么就认为整个 be 
的内存被划分为 max_concurrency 个 slot，每个 slot 占用的内存是 memory_limit / max_concurrency。
-- 默认情况下，每个 query 运行占用 1 个 slot，如果用户想让一个 query 使用更多的内存，那么就需要修改 query_slot_count 
的值。
-- 由于 Workload Group 的 slot 的总数是固定的，假如用户调大 query_slot_count，相当于每个 query 占用了更多的 
slot，那么整个 Workload Group 可同时运行的 query 的数量就动态减少了，新来的 query 就自动排队。
-  
-Workload Group 的 slot_memory_policy，这个参数可以有 3 个可选的值：
-- disabled，默认值，表示不启用，使用静态内存分配方式；
-- fixed，每个 query 可以使用的的内存 = Workload Group 的 mem_limit * query_slot_count/ 
max_concurrency.
-- dynamic，每个 query 可以使用的的内存 = Workload Group 的 mem_limit * query_slot_count/ 
sum(running query slots)，它主要是克服了 fixed 模式下，会存在有一些 slot 没有使用的情况。
-  fixed 或者 dynamic 都是设置的 query 的硬限，一旦超过，就会落盘或者 kill；而且会覆盖用户设置的静态内存分配的参数。所以当要设置 
slot_memory_policy 时，一定要设置好 Workload Group 的 max_concurrency，否则会出现内存不足的问题。
-
-## 落盘
-### 开启查询中间结果落盘
-#### BE 配置项
-```sql
+
+Query 运行的内存受exec\_mem\_limit这个参数控制，在query 运行之前用户就需要在session variable 
里设置好，运行期间不能够动态修改。
+
+* exec\_mem\_limit，表示一个query 最大可以使用的内存，默认值100G；这个值在3.1 
版本之前，默认值是2G，实际偏小，大部分查询都需要超过2G的内存，由于这个参数并没有真正在BE端生效，所以对查询并没有影响；在3.1 
版本之后，当查询使用的内存达到这个限制时，查询会被Cancel或者触发落盘，所以在升级之前用户需要把这个默认值改为100G。
+
+### 基于Slot的内存分配
+
+静态内存分配方式，在使用过程中我们发现，很多时候用户不知道一个query 应该分配多少内存，所以经常把exec\_mem\_limit 设置为整个BE 
进程内存的一半，也就是整个BE内所有的query 
使用的内存都不允许超过整个进程内存的一半，这个功能在这种场景下实际变成了一个类似熔断的功能。当我们要根据内存的大小做一些更精细的策略控制，比如spill 
disk时，由于这个值太大了，所以不能依赖它来做一些控制。
+
+所以我们基于workload group 实现了一个新的基于slot的内存限制方式，这个策略的原理如下：
+
+* 每个workload group 用户都配置了2个参数，max\_memory\_percent和max\_concurrency，那么就认为整个be 
的内存被划分为 max\_concurrency 个slot，每个slot 占用的内存是max\_memory\_percent \* mem\_limit 
/ max\_concurrency。
+* 默认情况下，每个query 运行占用1个slot，如果用户想让一个query 使用更多的内存，那么就需要修改`query_slot_count` 的值。
+* 由于workload group 的slot的总数是固定的，假如用户调大query\_slot\_count，相当于每个query 
占用了更多的slot，那么整个workload group 可同时运行的query的数量就动态减少了，新来的query 就自动排队。
+
+Workload group的slot\_memory\_policy，这个参数可以有3个可选的值：
+
+* none，默认值，表示不启用；在这种方式下，Query 就尽量的使用内存，但是一旦达到Workload 
Group的上限，就会触发落盘；此时不会根据查询的大小选择。
+* fixed，每个query 可以使用的的内存 = `workload group的mem_limit * query_slot_count/ 
max_concurrency`；这种内存分配策略实际是按照并发，给每个Query 分配了固定的内存。
+* dynamic，每个query 可以使用的的内存 = `workload group的mem_limit * query_slot_count/ 
``sum(running query slots)`，它主要是克服了fixed 模式下，会存在有一些slot 没有使用的情况；实际就是把大的查询先落盘。
+
+fixed或者dynamic 都是设置的query的硬限，一旦超过，就会落盘或者kill；而且会覆盖用户设置的静态内存分配的参数。 
所以当要设置slot\_memory\_policy时，一定要设置好workload group的max\_concurrency，否则会出现内存不足的问题。
+
+# 落盘
+
+## 开启查询中间结果落盘
+
+### BE配置项
+
+```JavaScript
 
spill_storage_root_path=/mnt/disk1/spilltest/doris/be/storage;/mnt/disk2/doris-spill;/mnt/disk3/doris-spill
 spill_storage_limit=100%
 ```
-- spill_storage_root_path：查询中间结果落盘文件存储路径，默认和 storage_root_path 一样。
-- spill_storage_limit: 落盘文件占用磁盘空间限制。可以配置具体的空间大小（比如 100G, 1T）或者百分比，默认是 20%。如果 
spill_storage_root_path 配置单独的磁盘，可以设置为 100%。这个参数主要是防止落盘占用太多的磁盘空间，导致无法进行正常的数据存储。
-  修改配置项之后，需要重启 BE 才能生效。
 
-#### FE Session Variable
-```sql
+* spill\_storage\_root\_path：查询中间结果落盘文件存储路径，默认和storage\_root\_path一样。
+* spill\_storage\_limit: 落盘文件占用磁盘空间限制。可以配置具体的空间大小（比如100G, 
1T）或者百分比，默认是20%。如果spill\_storage\_root\_path配置单独的磁盘，可以设置为100%。这个参数主要是防止落盘占用太多的磁盘空间，导致无法进行正常的数据存储。
+
+修改配置项之后，需要重启BE才能生效。
+
+### FE Session Variable
+
+```JavaScript
 set enable_spill=true;
-set exec_mem_limit = 10g
-set enable_mem_overcommit = false
+set exec_mem_limit = 10g;
+set query_timeout = 3600;
 ```
-- enable_spill 表示一个 query 是否开启落盘；
-- exec_mem_limit 表示一个 query 使用的最大的内存大小；
-- enable_mem_overcommit query 是否可以使用超过 exec_mem_limit 大小的内存限制
 
-#### Workload Group
+* enable\_spill 表示一个query 是否开启落盘，默认关闭；如果开启，在内存紧张的情况下，会触发查询落盘；
+* exec\_mem\_limit 表示一个query 使用的最大的内存大小；
+* query\_timeout 开启落盘，查询时间可能会显著增加，query\_timeout需要进行调整。
+
+### Workload Group
 
-默认 Workload Group 的 memory_limit 默认是 30%，可按实际的 Workload Group 的数量合理修改。如果只有一个 
Workload Group，可以调整为 90%。
+* `max_``memory_`percent 默认workload group 
的`max_memory_`percent默认值是100%，可按实际的workload group的数量合理修改。如果只有一个workload 
group，可以调整为90%。
 
-```sql
-alter Workload Group normal properties ( 'memory_limit'='90%' );
+```Bash
+alter workload group normal properties ( 'max_memory_percent'='90%' );
 ```
 
-### 监测落盘
-#### 审计日志
+* `slot_memory_policy` 设置为`fixed`或者`dynamic`。具体含义参见`基于Slot的内存分配`章节。
 
-FE Audit Log 中增加了 SpillWriteBytesToLocalStorage 和 
SpillReadBytesFromLocalStorage 字段，分别表示落盘时写盘和读盘数据总量。
+```C++
+alter workload group normal properties ('slot_memory_policy'='dynamic');
+```
+
+## 监测落盘
+
+### 审计日志
 
-```sql
+FE audit 
log中增加了`SpillWriteBytesToLocalStorage`和`SpillReadBytesFromLocalStorage`字段，分别表示落盘时写盘和读盘数据总量。
+
+```Plain
 
SpillWriteBytesToLocalStorage=503412182|SpillReadBytesFromLocalStorage=503412182
 ```
 
-#### Profile
-如果查询过程中触发了落盘，在 Query Profile 中增加了 Spill 前缀的一些 Counter 进行标记和落盘相关 counter。以 
HashJoin 时 Build HashTable 为例，可以看到下面的 Counter：
+### Profile
+
+如果查询过程中触发了落盘，在Query Profile中增加了`Spill` 
前缀的一些Counter进行标记和落盘相关counter。以HashJoin时Build HashTable为例，可以看到下面的Counter：
 
-```sql
+```Bash
 PARTITIONED_HASH_JOIN_SINK_OPERATOR  (id=4  ,  nereids_id=179):(ExecTime:  
6sec351ms)
       -  Spilled:  true
       -  CloseTime:  528ns
@@ -133,13 +162,13 @@ PARTITIONED_HASH_JOIN_SINK_OPERATOR  (id=4  ,  
nereids_id=179):(ExecTime:  6sec3
       -  SpillWriteTime:  5sec549ms
 ```
 
-#### 系统表
+### 系统表
 
-##### backend_active_tasks
+#### backend\_active\_tasks
 
-增加了 `SPILL_WRITE_BYTES_TO_LOCAL_STORAGE` 和 
`SPILL_READ_BYTES_FROM_LOCAL_STORAGE` 字段，分别表示一个查询目前落盘中间结果写盘数据和读盘数据总量。
+增加了`SPILL_WRITE_BYTES_TO_LOCAL_STORAGE`和`SPILL_READ_BYTES_FROM_LOCAL_STORAGE`字段，分别表示一个查询目前落盘中间结果写盘数据和读盘数据总量。
 
-```sql
+```Bash
 mysql [information_schema]>select * from backend_active_tasks;
 
+-------+------------+-------------------+-----------------------------------+--------------+------------------+-----------+------------+----------------------+---------------------------+--------------------+-------------------+------------+------------------------------------+-------------------------------------+
 | BE_ID | FE_HOST    | WORKLOAD_GROUP_ID | QUERY_ID                          | 
TASK_TIME_MS | TASK_CPU_TIME_MS | SCAN_ROWS | SCAN_BYTES | BE_PEAK_MEMORY_BYTES 
| CURRENT_USED_MEMORY_BYTES | SHUFFLE_SEND_BYTES | SHUFFLE_SEND_ROWS | 
QUERY_TYPE | SPILL_WRITE_BYTES_TO_LOCAL_STORAGE | 
SPILL_READ_BYTES_FROM_LOCAL_STORAGE |
@@ -150,37 +179,31 @@ mysql [information_schema]>select * from 
backend_active_tasks;
 2 rows in set (0.00 sec)
 ```
 
-##### workload_group_resource_usage
-增加了 WRITE_BUFFER_USAGE_BYTES 字段，表示 Workload Group 中的导入任务 Memtable 内存占用。
-
-```sql
-mysql [information_schema]>select * from workload_group_resource_usage;
-+-------+-------------------+--------------------+-------------------+-----------------------------+------------------------------+--------------------------+
-| BE_ID | WORKLOAD_GROUP_ID | MEMORY_USAGE_BYTES | CPU_USAGE_PERCENT | 
LOCAL_SCAN_BYTES_PER_SECOND | REMOTE_SCAN_BYTES_PER_SECOND | 
WRITE_BUFFER_USAGE_BYTES |
-+-------+-------------------+--------------------+-------------------+-----------------------------+------------------------------+--------------------------+
-| 10009 |                 1 |          102314948 |              0.69 |         
                  0 |                            0 |                 23404836 |
-+-------+-------------------+--------------------+-------------------+-----------------------------+------------------------------+--------------------------+
-1 row in set (0.01 sec)
-```
+# 测试
+
+## 测试环境
+
+### 机器配置
 
-## 测试
-### 测试环境
-#### 机器配置
 测试使用阿里云服务器，具体配置如下。
 
 1FE:
-```
+
+```Bash
 16核(vCPU) 32 GiB 200 Mbps ecs.c6.4xlarge
 ```
 
 3BE:
-```
+
+```Bash
 16核(vCPU) 64 GiB 0 Mbps ecs.g6.4xlarge
 ```
 
-#### 测试数据
-测试数据使用 TPC-DS 10TB 作为数据输入，使用阿里云 DLF 数据源，使用 Catalog 的方式挂载到 Doris 内，SQL 语句如下：
-```sql
+### 测试数据
+
+测试数据使用TPC-DS 10TB作为数据输入，使用阿里云DLF数据源，使用Catalog的方式挂载到Doris 内，SQL 语句如下：
+
+```Bash
 CREATE CATALOG dlf PROPERTIES (
 "type"="hms",
 "hive.metastore.type" = "dlf",
@@ -188,117 +211,117 @@ CREATE CATALOG dlf PROPERTIES (
 "dlf.endpoint" = "dlf-vpc.cn-beijing.aliyuncs.com",
 "dlf.region" = "cn-beijing",
 "dlf.uid" = "217316283625971977",
-"dlf.catalog.id" = "emr_dev",
-"dlf.access_key" = "按情况填写",
-"dlf.secret_key" = "按情况填写"
+"dlf.catalog.id" = "emr_dev"
 );
 ```
 
-参考官网链接：https://doris.apache.org/zh-CN/docs/dev/benchmark/tpcds
-
-### 测试结果
-数据的规模是 10TB。内存和数据规模的比例是 1:52，整体运行时间 32000s，能够跑出所有的 99 条 
query。未来我们将对更多的算子提供落盘能力（如 window function，Intersect 
等），同时继续优化落盘情况下的性能，降低对磁盘的消耗，提升查询的稳定性。
-
-| query   |Time(ms)|
-|---------|---------|
-| query1  |25590|
-| query2  |126445|
-| query3  |103859|
-| query4  |1174702|
-| query5  |266281|
-| query6  |62950|
-| query7  |212745|
-| query8  |67000|
-| query9  |602291|
-| query10 |70904|
-| query11 |544436|
-| query12 |25759|
-| query13 |229144|
-| query14 |1120895|
-| query15 |29409|
-| query16 |117287|
-| query17 |260122|
-| query18 |97453|
-| query19 |127384|
-| query20 |32749|
-| query21 |4471|
-| query22 |10162|
-| query23 |1772561|
-| query24 |535506|
-| query25 |272458|
-| query26 |83342|
-| query27 |175264|
-| query28 |887007|
-| query29 |427229|
-| query30 |13661|
-| query31 |108778|
-| query32 |37303|
-| query33 |181351|
-| query34 |84159|
-| query35 |81701|
-| query36 |152990|
-| query37 |36815|
-| query38 |172531|
-| query39 |20155|
-| query40 |75749|
-| query41 |527|
-| query42 |95910|
-| query43 |66821|
-| query44 |209947|
-| query45 |26946|
-| query46 |131490|
-| query47 |158011|
-| query48 |149482|
-| query49 |303515|
-| query50 |298089|
-| query51 |156487|
-| query52 |97440|
-| query53 |98258|
-| query54 |202583|
-| query55 |93268|
-| query56 |185255|
-| query57 |80308|
-| query58 |252746|
-| query59 |171545|
-| query60 |202915|
-| query61 |272184|
-| query62 |38749|
-| query63 |94327|
-| query64 |247074|
-| query65 |270705|
-| query66 |101465|
-| query67 |3744186|
-| query68 |151543|
-| query69 |15559|
-| query70 |132505|
-| query71 |180079|
-| query72 |3085373|
-| query73 |82623|
-| query74 |330087|
-| query75 |830993|
-| query76 |188805|
-| query77 |239730|
-| query78 |1895765|
-| query79 |144829|
-| query80 |463652|
-| query81 |15319|
-| query82 |76961|
-| query83 |32437|
-| query84 |22849|
-| query85 |58186|
-| query86 |33933|
-| query87 |185421|
-| query88 |434867|
-| query89 |108265|
-| query90 |31131|
-| query91 |18864|
-| query92 |24510|
-| query93 |281904|
-| query94 |67761|
-| query95 |3738968|
-| query96 |47245|
-| query97 |536702|
-| query98 |97800|
-| query99 |62210|
-| sum     |31797707|
+参考官网链接: https://doris.apache.org/zh-CN/docs/dev/benchmark/tpcds
+
+## 测试结果
+
+### 单并发
+
+数据的规模是10TB。内存和数据规模的比例是1:52，整体运行时间28102.386s，能够跑出所有的99条query。未来我们将对更多的算子提供落盘能力（如window
 function， Intersect等），同时继续优化落盘情况下的性能，降低对磁盘的消耗，提升查询的稳定性。
 
+| Query    | Doris |
+| ---------- | ------- |
+| query1   | 29092 |
+| query2   | 130003 |
+| query3   | 96119 |
+| query4   | 1199097 |
+| query5   | 212719 |
+| query6   | 62259 |
+| query7   | 209154 |
+| query8   | 62433 |
+| query9   | 579371 |
+| query10  | 54260 |
+| query11  | 560169 |
+| query12  | 26084 |
+| query13  | 228756 |
+| query14  | 1137097 |
+| query15  | 27509 |
+| query16  | 84806 |
+| query17  | 288164 |
+| query18  | 94770 |
+| query19  | 124955 |
+| query20  | 30970 |
+| query21  | 4333 |
+| query22  | 9890 |
+| query23  | 1757755 |
+| query24  | 399553 |
+| query25  | 291474 |
+| query26  | 79832 |
+| query27  | 175894 |
+| query28  | 647497 |
+| query29  | 1299597 |
+| query30  | 11434 |
+| query31  | 106665 |
+| query32  | 33481 |
+| query33  | 146101 |
+| query34  | 84055 |
+| query35  | 69885 |
+| query36  | 148662 |
+| query37  | 21598 |
+| query38  | 164746 |
+| query39  | 5874 |
+| query40  | 51602 |
+| query41  | 563 |
+| query42  | 93005 |
+| query43  | 67769 |
+| query44  | 79527 |
+| query45  | 26575 |
+| query46  | 134991 |
+| query47  | 161873 |
+| query48  | 153657 |
+| query49  | 259387 |
+| query50  | 141421 |
+| query51  | 158056 |
+| query52  | 91392 |
+| query53  | 89497 |
+| query54  | 124118 |
+| query55  | 82584 |
+| query56  | 152110 |
+| query57  | 83417 |
+| query58  | 259580 |
+| query59  | 177125 |
+| query60  | 161729 |
+| query61  | 258058 |
+| query62  | 39619 |
+| query63  | 91258 |
+| query64  | 234882 |
+| query65  | 278610 |
+| query66  | 90246 |
+| query67  | 3939554 |
+| query68  | 183648 |
+| query69  | 11031 |
+| query70  | 137901 |
+| query71  | 166454 |
+| query72  | 2859001 |
+| query73  | 92015 |
+| query74  | 336694 |
+| query75  | 838989 |
+| query76  | 174235 |
+| query77  | 174525 |
+| query78  | 1956786 |
+| query79  | 162259 |
+| query80  | 602088 |
+| query81  | 16184 |
+| query82  | 56292 |
+| query83  | 26211 |
+| query84  | 11906 |
+| query85  | 57739 |
+| query86  | 34350 |
+| query87  | 173631 |
+| query88  | 449003 |
+| query89  | 113799 |
+| query90  | 30825 |
+| query91  | 12239 |
+| query92  | 26695 |
+| query93  | 275828 |
+| query94  | 56464 |
+| query95  | 64932 |
+| query96  | 48102 |
+| query97  | 597371 |
+| query98  | 112399 |
+| query99  | 64472 |
+| Sum      | 28102386 |
\ No newline at end of file
diff --git a/sidebars.json b/sidebars.json
index 5201a39c217..8221db02f7e 100644
--- a/sidebars.json
+++ b/sidebars.json
@@ -624,6 +624,7 @@
                         },
                         "admin-manual/workload-management/analysis-diagnosis",
                         
"admin-manual/workload-management/concurrency-control-and-queuing",
+                        "admin-manual/workload-management/spill-disk",
                         "admin-manual/workload-management/sql-blocking",
                         "admin-manual/workload-management/kill-query",
                         "admin-manual/workload-management/job-scheduler"


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(doris-website) branch master updated: add spill disk doc (#2834)

Reply via email to