[doris] branch master updated: [enhancement][docs] add docs for newly added two compaction method (#16529) (#16530)

morningman Wed, 08 Feb 2023 17:07:45 -0800

This is an automated email from the ASF dual-hosted git repository.

morningman pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git



The following commit(s) were added to refs/heads/master by this push:
     new e6b0d94459 [enhancement][docs] add docs for newly added two compaction 
method (#16529) (#16530)
e6b0d94459 is described below

commit e6b0d94459eebf83dbf9053599c04aedd9b7781b
Author: zhengyu <[email protected]>
AuthorDate: Thu Feb 9 09:07:33 2023 +0800

    [enhancement][docs] add docs for newly added two compaction method (#16529) 
(#16530)
    
    Co-authored-by: yixiutt <[email protected]>
    Co-authored-by: zhengyu <[email protected]>
    
    Signed-off-by: freemandealer <[email protected]>
---
 docs/en/docs/advanced/best-practice/compaction.md  | 71 ++++++++++++++++++++++
 docs/sidebars.json                                 |  3 +-
 .../docs/advanced/best-practice/compaction.md      | 67 ++++++++++++++++++++
 3 files changed, 140 insertions(+), 1 deletion(-)

diff --git a/docs/en/docs/advanced/best-practice/compaction.md 
b/docs/en/docs/advanced/best-practice/compaction.md
new file mode 100644
index 0000000000..b36e37c290
--- /dev/null
+++ b/docs/en/docs/advanced/best-practice/compaction.md
@@ -0,0 +1,71 @@
+---
+{
+    "title": "Compaction",
+    "language": "en"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+
+# Compaction
+
+Doris writes data through a structure similar to LSM-Tree, and continuously 
merges small files into large ordered files through compaction in the 
background. Compaction handles operations such as deletion and updating. 
+
+Appropriately adjusting the compaction strategy can greatly improve load and 
query efficiency. Doris provides the following two compaction strategies for 
tuning:
+
+
+## Vertical compaction
+
+Vertical compaction is a new compaction algorithm implemented in Doris 2.0, 
which is used to optimize compaction execution efficiency and resource overhead 
in large-scale and wide table scenarios. It can effectively reduce the memory 
overhead of compaction and improve the execution speed of compaction. The test 
results show that the memory consumption by vertical compaction is only 1/10 of 
the original compaction algorithm, and the compaction rate is increased by 15%.
+
+In vertical compaction, merging by row is changed to merging by column group. 
The granularity of each merge is changed to column group, which reduces the 
amount of data involved in single compaction and reduces the memory usage 
during compaction.
+
+BE configuration：
+`enable_vertical_compaction = true` will turn on vertical compaction
+`vertical_compaction_num_columns_per_group = 5` The number of columns 
contained in each column group, by testing, the efficiency and memory usage of 
a group of 5 columns by default is more friendly
+`vertical_compaction_max_segment_size` is used to configure the size of the 
disk file after vertical compaction, the default value is 268435456 (bytes)
+
+
+## Segment compaction
+
+Segment compaction mainly deals with the large-scale data load. Segment 
compaction operates during the load process and compact segments inside the 
job, which is different from normal compaction and vertical compaction. This 
mechanism can effectively reduce the number of generated segments and avoid the 
-238 (OLAP_ERR_TOO_MANY_SEGMENTS) errors.
+
+The following features are provided by segment compaction:
+- reduce the number of segments generated by load
+- the compacting process is parallel to the load process, which will not 
increase the load time
+- memory consumption and computing resources will increase during loading, but 
the increase is relatively low because it is evenly distributed throughout the 
long load process.
+- data after segment compaction will have resource and performance advantages 
in subsequent queries and normal compaction.
+
+BE configuration:
+- `enable_segcompaction=true` turn it on.
+- `segcompaction_threshold_segment_num` is used to configure the interval for 
merging. The default value 10 means that every 10 segment files will trigger a 
segment compaction. It is recommended to set between 10 - 30. The larger value 
will increase the memory usage of segment compaction.
+
+Situations where segment compaction is recommended:
+
+- Loading large amounts of data fails at OLAP_ ERR_ TOO_ MANY_ SEGMENTS 
(errcode - 238) error. Then it is recommended to turn on segment compaction to 
reduce the quantity of segments during the load process.
+- Too many small files are generated during the load process: although the 
amount of loading data is reasonable, the generation of a large number of small 
segment files may also fail the load job because of low cardinality or memory 
constraints that trigger memtable to be flushed in advance. Then it is 
recommended to turn on this function.
+- Query immediately after loading. When the load is just finished and the 
standard compaction has not finished, large number of segment files will affect 
the efficiency of subsequent queries. If the user needs to query immediately 
after loading, it is recommended to turn on this function.
+- The pressure of normal compaction is high after loading: segment compaction 
evenly puts part of the pressure of normal compaction on the loading process. 
At this time, it is recommended to enable this function.
+
+Situations where segment compaction is not recommended:
+- When the load operation itself has exhausted memory resources, it is not 
recommended to use the segment compaction to avoid further increasing memory 
pressure and causing the load job to fail.
+
+Refer to this [link](https://github.com/apache/doris/pull/12866) for more 
information about implementation and test results.
diff --git a/docs/sidebars.json b/docs/sidebars.json
index 446fc0a8b1..565f5e46d0 100644
--- a/docs/sidebars.json
+++ b/docs/sidebars.json
@@ -166,7 +166,8 @@
                     "items": [
                         "advanced/best-practice/query-analysis",
                         "advanced/best-practice/import-analysis",
-                        "advanced/best-practice/debug-log"
+                        "advanced/best-practice/debug-log",
+                        "advanced/best-practice/compaction"
                     ]
                 },
                 "advanced/resource",
diff --git a/docs/zh-CN/docs/advanced/best-practice/compaction.md 
b/docs/zh-CN/docs/advanced/best-practice/compaction.md
new file mode 100644
index 0000000000..b6cf8e4788
--- /dev/null
+++ b/docs/zh-CN/docs/advanced/best-practice/compaction.md
@@ -0,0 +1,67 @@
+---
+{
+    "title": "Compaction",
+    "language": "zh-CN"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Compaction
+
+Doris 通过类似 LSM-Tree 的结构写入数据，在后台通过 Compaction 
机制不断将小文件合并成有序的大文件，同时也会处理数据的删除、更新等操作。适当的调整 Compaction 的策略，可以极大地提升导入效率和查询效率。
+Doris 提供如下2种compaction方式进行调优：
+
+
+## Vertical compaction
+Vertical compaction 是 Doris 2.0 版本中实现的新的 Compaction 算法，用于解决大宽表场景下的 Compaction 
执行效率和资源开销问题。可以有效降低Compaction的内存开销，并提升 Compaction 的执行速度。
+实际测试中，Vertical compaction 使用内存仅为原有compaction算法的1/10，同时compaction速率提升15%。
+
+Vertical 
compaction中将按行合并的方式改变为按列组合并，每次参与合并的粒度变成列组，降低单次compaction内部参与的数据量，减少compaction期间的内存使用。
+
+BE配置：
+`enable_vertical_compaction = true` 可以开启该功能
+`vertical_compaction_num_columns_per_group = 5` 
每个列组包含的列个数，经测试，默认5列一组compaction的效率及内存使用较友好
+`vertical_compaction_max_segment_size` 用于配置vertical 
compaction之后落盘文件的大小，默认值268435456(字节)
+
+
+## Segment compaction
+Segment compaction 主要应对单批次大数据量的导入场景。和 Vertical compaction 的触发机制不同，Segment 
compaction 是在导入过程中，针对一批次数据内，多个 Segment 进行的合并操作。这种机制可以有效减少最终生成的 Segment 数量，避免 
-238 （OLAP_ERR_TOO_MANY_SEGMENTS）错误的出现。
+Segmetn compaction 有以下特点：
+
+- 可以减少导入产生的 segment 数量
+- 合并过程与导入过程并行，不会额外增加导入时间
+- 导入过程中的内存和计算资源的使用量会有增加，但因为平摊在整个导入过程中所以涨幅较低
+- 经过 Segment compaction 后的数据在进行后续查询以及标准 compaction 时会有资源和性能上的优势
+
+开启和配置方法(BE 配置)：
+- `enable_segcompaction = true` 可以使能该功能
+- `segcompaction_threshold_segment_num` 用于配置合并的间隔。默认 10 表示每生成 10 个 segment 
文件将会进行一次 segment compaction。一般设置为 10 - 30，过大的值会增加 segment compaction 的内存用量。
+
+如有以下场景或问题，建议开启此功能：
+- 导入大量数据触发 OLAP_ERR_TOO_MANY_SEGMENTS (errcode -238) 错误导致导入失败。此时建议打开 segment 
compaction 的功能，在导入过程中对 segment 进行合并控制最终的数量。
+- 导入过程中产生大量的小文件：虽然导入数据量不大，但因为低基数数据，或因为内存紧张触发 memtable 提前下刷，产生大量小 segment  
文件也可能会触发 OLAP_ERR_TOO_MANY_SEGMENTS 导致导入失败。此时建议打开该功能。
+- 导入大量数据后立即进行查询：刚导入完成、标准 compaction 还没有完成工作时，此时 segment 
文件过多会影响后续查询效率。如果用户有导入后立即查询的需求，建议打开该功能。
+- 导入后标准 compaction 压力大：segment compaction 本质上是把标准 compaction 
的一部分压力放在了导入过程中进行处理，此时建议打开该功能。
+
+不建议使用的情况：
+- 导入操作本身已经耗尽了内存资源时，不建议使用 segment compaction 以免进一步增加内存压力使导入失败。
+
+关于 segment compaction 
的实现和测试结果可以查阅[此链接](https://github.com/apache/doris/pull/12866)。


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[doris] branch master updated: [enhancement][docs] add docs for newly added two compaction method (#16529) (#16530)

Reply via email to