[doris] branch master updated: [docs & fix](stats) Fix tablesample init failed and some outdated contents in docs (#25603)

morrysnow Wed, 25 Oct 2023 02:38:15 -0700

This is an automated email from the ASF dual-hosted git repository.

morrysnow pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git



The following commit(s) were added to refs/heads/master by this push:
     new 4bda1650e12 [docs & fix](stats) Fix tablesample init failed and some 
outdated contents in docs (#25603)
4bda1650e12 is described below

commit 4bda1650e12b6dd8b36db97493aa4b4b181306f1
Author: AKIRA <[email protected]>
AuthorDate: Wed Oct 25 17:38:00 2023 +0800

    [docs & fix](stats) Fix tablesample init failed and some outdated contents 
in docs (#25603)
---
 docs/en/docs/query-acceleration/statistics.md                     | 8 ++++----
 docs/zh-CN/docs/query-acceleration/statistics.md                  | 8 ++++----
 .../main/java/org/apache/doris/statistics/BaseAnalysisTask.java   | 2 +-
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/docs/en/docs/query-acceleration/statistics.md 
b/docs/en/docs/query-acceleration/statistics.md
index 4bb3d941b07..069c25fb1a8 100644
--- a/docs/en/docs/query-acceleration/statistics.md
+++ b/docs/en/docs/query-acceleration/statistics.md
@@ -75,7 +75,7 @@ This feature collects statistics only for tables and columns 
that either have no
 
 For tables with a large amount of data (default is 5GiB), Doris will 
automatically use sampling to collect statistics, reducing the impact on the 
system and completing the collection job as quickly as possible. Users can 
adjust this behavior by setting the `huge_table_lower_bound_size_in_bytes` FE 
parameter. If you want to collect statistics for all tables in full, you can 
set the `enable_auto_sample` FE parameter to false. For tables with data size 
greater than `huge_table_lower_bound_s [...]
 
-The default sample size for automatic sampling is 200,000 rows, but the actual 
sample size may be larger due to implementation reasons. If you want to sample 
more rows to obtain more accurate data distribution information, you can 
configure the `auto_analyze_job_record_count` FE parameter.
+The default sample size for automatic sampling is 4194304(2^22) rows, but the 
actual sample size may be larger due to implementation reasons. If you want to 
sample more rows to obtain more accurate data distribution information, you can 
configure the `huge_table_default_sample_rows` FE parameter.
 
 ### Task Management
 
@@ -234,8 +234,6 @@ Automatic collection tasks do not support viewing the 
completion status and fail
 | statistics_simultaneously_running_task_num              | After submitting 
asynchronous jobs using `ANALYZE TABLE[DATABASE]`, this parameter limits the 
number of columns that can be analyzed simultaneously. All asynchronous tasks 
are collectively constrained by this parameter.                                 
                                                                                
                                                                                
                     [...]
 | analyze_task_timeout_in_minutes                         | Timeout for 
AnalyzeTask execution.                                                          
                                                                                
                                                                                
                                                         | 12 hours             
          |
 | stats_cache_size| The actual memory usage of statistics cache depends 
heavily on the characteristics of the data because the average size of 
maximum/minimum values and the number of buckets in histograms can vary greatly 
in different datasets and scenarios. Additionally, factors like JVM versions 
can also affect it. Below is the memory size occupied by statistics cache with 
100,000 items. The average length of maximum/minimum values per item is 32, the 
average length of column names is [...]
-|full_auto_analyze_start_time|Start time for automatic statistics 
collection|00:00:00|
-|full_auto_analyze_end_time|End time for automatic statistics 
collection|02:00:00|
 |enable_auto_sample|Enable automatic sampling for large tables. When enabled, 
statistics will be automatically collected through sampling for tables larger 
than the `huge_table_lower_bound_size_in_bytes` threshold.| false|
 |auto_analyze_job_record_count|Controls the persistence of records for 
automatically triggered statistics collection jobs.|20000|
 |huge_table_default_sample_rows|Defines the number of sample rows for large 
tables when automatic sampling is enabled.|4194304|
@@ -249,7 +247,7 @@ Automatic collection tasks do not support viewing the 
completion status and fail
 |full_auto_analyze_end_time|End time for automatic statistics 
collection|02:00:00|
 |enable_full_auto_analyze|Enable automatic collection functionality|true|
 
-Please note that when both FE configuration and global session variables are 
configured for the same parameter, the value of the global session variable 
takes precedence.
+ATTENTION: The session variables listed above must be set globally using SET 
GLOBAL.
 
 ## Usage Recommendations
 
@@ -273,6 +271,8 @@ The SQL execution time is controlled by the `query_timeout` 
session variable, wh
 
 When ANALYZE is executed, statistics data is written to the internal table 
`__internal_schema.column_statistics`. FE checks the tablet status of this 
table before executing ANALYZE. If there are unavailable tablets, the task is 
rejected. Please check the BE cluster status if this error occurs.
 
+Users can use `SHOW BACKENDS\G` to verify the BE (Backend) status. If the BE 
status is normal, you can use the command `ADMIN SHOW REPLICA STATUS FROM 
__internal_schema.[tbl_in_this_db]` to check the tablet status within this 
database, ensuring that the tablet status is also normal.
+
 ### Failure of ANALYZE on Large Tables
 
 Due to resource limitations, ANALYZE on some large tables may timeout or 
exceed BE memory limits. In such cases, it is recommended to use `ANALYZE ... 
WITH SAMPLE...`. 
diff --git a/docs/zh-CN/docs/query-acceleration/statistics.md 
b/docs/zh-CN/docs/query-acceleration/statistics.md
index fd3066995e6..d9aac9b6780 100644
--- a/docs/zh-CN/docs/query-acceleration/statistics.md
+++ b/docs/zh-CN/docs/query-acceleration/statistics.md
@@ -75,7 +75,7 @@ ANALYZE < TABLE | DATABASE table_name | db_name >
 
 
对于数据量较大（默认为5GiB）的表，Doris会自动采取采样的方式去收集，以尽可能降低对系统造成的负担并尽快完成收集作业，用户可通过设置FE参数`huge_table_lower_bound_size_in_bytes`来调节此行为。如果希望对所有的表都采取全量收集，可配置FE参数`enable_auto_sample`为false。同时对于数据量大于`huge_table_lower_bound_size_in_bytes`的表，Doris保证其收集时间间隔不小于12小时（该时间可通过FE参数`huge_table_auto_analyze_interval_in_millis`控制）。
 
-自动采样默认采样200000行，但由于实现方式的原因实际采样数可能大于该值。如果希望采样更多的行以获得更准确的数据分布信息，可通过FE参数`auto_analyze_job_record_count`配置。
+自动采样默认采样4194304(2^22)行，但由于实现方式的原因实际采样数可能大于该值。如果希望采样更多的行以获得更准确的数据分布信息，可通过FE参数`huge_table_default_sample_rows`配置。
 
 ### 任务管理
 
@@ -246,8 +246,6 @@ SHOW AUTO ANALYZE [ptable_name]
 | statistics_simultaneously_running_task_num              | 通过`ANALYZE 
TABLE[DATABASE]`提交异步作业后，可同时analyze的列的数量，所有异步任务共同受到该参数约束                         
                                                                                
                                                                                
                                         | 5                              |
 | analyze_task_timeout_in_minutes                         | AnalyzeTask执行超时时间  
                                                                                
                                                                                
                                                                                
                                 | 12 hours                       |
 |stats_cache_size| 
统计信息缓存的实际内存占用大小高度依赖于数据的特性，因为在不同的数据集和场景中，最大/最小值的平均大小和直方图的桶数量会有很大的差异。此外，JVM版本等因素也会对其产生影响。下面给出统计信息缓存在包含100000个项目时所占用的内存大小。每个项目的最大/最小值的平均长度为32，列名的平均长度为16，统计信息缓存总共占用了61.2777404785MiB的内存。强烈不建议分析具有非常大字符串值的列，因为这可能导致FE内存溢出。
 | 100000                        |
-|full_auto_analyze_start_time|自动统计信息收集开始时间|00:00:00|
-|full_auto_analyze_end_time|自动统计信息收集结束时间|02:00:00|
 
|enable_auto_sample|是否开启大表自动sample，开启后对于大小超过huge_table_lower_bound_size_in_bytes会自动通过采样收集|
 false|
 |auto_analyze_job_record_count|控制统计信息的自动触发作业执行记录的持久化行数|20000|
 |huge_table_default_sample_rows|定义开启开启大表自动sample后，对大表的采样行数|4194304|
@@ -261,7 +259,7 @@ SHOW AUTO ANALYZE [ptable_name]
 |full_auto_analyze_end_time|自动统计信息收集结束时间|02:00:00|
 |enable_full_auto_analyze|开启自动收集功能|true|
 
-注意，对于fe配置和全局会话变量中均可配置的参数都设置的情况下，优先使用全局会话变量参数值。
+注意：上面列出的会话变量必须通过`SET GLOBAL`全局设置。
 
 ## 使用建议
 
@@ -285,6 +283,8 @@ SQL执行时间受`query_timeout`会话变量控制，该变量默认值为300
 
 
执行ANALYZE时统计数据会被写入到内部表`__internal_schema.column_statistics`中，FE会在执行ANALYZE前检查该表tablet状态，如果存在不可用的tablet则拒绝执行任务。出现该报错请检查BE集群状态。
 
+用户可通过`SHOW BACKENDS\G`，确定BE状态是否正常。如果BE状态正常，可使用命令`ADMIN SHOW REPLICA STATUS 
FROM __internal_schema.[tbl_in_this_db]`，检查该库下tablet状态，确保tablet状态正常。
+
 ### 大表ANALYZE失败
 
 由于ANALYZE能够使用的资源受到比较严格的限制，对一些大表的ANALYZE操作有可能超时或者超出BE内存限制。这些情况下，建议使用 `ANALYZE 
... WITH SAMPLE...`。
diff --git 
a/fe/fe-core/src/main/java/org/apache/doris/statistics/BaseAnalysisTask.java 
b/fe/fe-core/src/main/java/org/apache/doris/statistics/BaseAnalysisTask.java
index 2eac86bd917..ad74266a7c3 100644
--- a/fe/fe-core/src/main/java/org/apache/doris/statistics/BaseAnalysisTask.java
+++ b/fe/fe-core/src/main/java/org/apache/doris/statistics/BaseAnalysisTask.java
@@ -146,11 +146,11 @@ public abstract class BaseAnalysisTask {
     }
 
     protected void init(AnalysisInfo info) {
-        tableSample = getTableSample();
         DBObjects dbObjects = 
StatisticsUtil.convertIdToObjects(info.catalogId, info.dbId, info.tblId);
         catalog = dbObjects.catalog;
         db = dbObjects.db;
         tbl = dbObjects.table;
+        tableSample = getTableSample();
         // External Table level task doesn't contain a column. Don't need to 
do the column related analyze.
         if (info.externalTableLevelTask) {
             return;


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[doris] branch master updated: [docs & fix](stats) Fix tablesample init failed and some outdated contents in docs (#25603)

Reply via email to