This is an automated email from the ASF dual-hosted git repository.
morningman pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git
The following commit(s) were added to refs/heads/master by this push:
new c78349a4c6 [Docs](statistics)Add external table statistic docs (#21567)
c78349a4c6 is described below
commit c78349a4c6dca41d4af29a7c497b2def09ab955b
Author: Jibing-Li <[email protected]>
AuthorDate: Thu Jul 13 17:54:34 2023 +0800
[Docs](statistics)Add external table statistic docs (#21567)
---
docs/en/docs/lakehouse/external_statistics.md | 277 +++++++++++++++++++++++
docs/en/docs/query-acceleration/statistics.md | 6 +-
docs/sidebars.json | 1 +
docs/zh-CN/docs/lakehouse/external_statistics.md | 274 ++++++++++++++++++++++
docs/zh-CN/docs/query-acceleration/statistics.md | 2 +
5 files changed, 559 insertions(+), 1 deletion(-)
diff --git a/docs/en/docs/lakehouse/external_statistics.md
b/docs/en/docs/lakehouse/external_statistics.md
new file mode 100644
index 0000000000..7329310f0b
--- /dev/null
+++ b/docs/en/docs/lakehouse/external_statistics.md
@@ -0,0 +1,277 @@
+---
+{
+ "title": "External Table Statistics",
+ "language": "en"
+}
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# External Table Statistics
+
+The collection method and content of the external table statistics are
basically the same as those of the internal table. For more information, please
refer to [Internal table statistics](../query-acceleration/statistics.md).
Currently supports the collection of Hive, Iceberg and Hudi external tables.
+
+Features that are not currently supported by the external table include
+
+1. Histogram collection is not supported yet
+2. Does not support incremental collection and update of partitions
+3. Automatic collection (with auto) is not supported for now, users can use
periodic collection (with period) instead
+4. Sampling collection is not currently supported
+
+The following mainly introduces the examples and implementation principles of
the collection of external table statistic information.
+
+## Examples of usage.
+
+Here is an example of collecting external table statistics by executing the
analyze command in Doris. Except for the 4 functions that are not supported
mentioned above, the rest are the same as the internal table. The following
takes the hive.tpch100 database as an example to show how to use it. The
tpch100 database contains 8 tables including lineitem, orders, region, etc.
+
+### Collection of statistics
+
+We supports two collection methods for external table: manual once collection
and periodic collection.
+
+#### Manual once collection
+
+- Collect the row count of `lineitem` table and the statistics of all columns
+```
+mysql> ANALYZE TABLE hive.tpch100.lineitem;
++--------------+-------------------------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
+| Catalog_Name | DB_Name | Table_Name | Columns
| Job_Id |
++--------------+-------------------------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
+| hive | default_cluster:tpch100 | lineitem |
[l_returnflag,l_receiptdate,l_tax,l_shipmode,l_suppkey,l_shipdate,l_commitdate,l_partkey,l_orderkey,l_quantity,l_linestatus,l_comment,l_extendedprice,l_linenumber,l_discount,l_shipinstruct]
| 126039 |
++--------------+-------------------------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
+1 row in set (0.03 sec)
+```
+This operation is performed asynchronously, a collection job will be created
in the background, and the progress of the job can be viewed using job_id
+```
+mysql> SHOW ANALYZE 126039;
++--------+--------------+-------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+---------+----------------------+---------+---------------+
+| job_id | catalog_name | db_name | tbl_name | col_name
| job_type | analysis_type | message | last_exec_time_in_ms |
state | schedule_type |
++--------+--------------+-------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+---------+----------------------+---------+---------------+
+| 126039 | hive | default_cluster:tpch100 | lineitem |
[l_returnflag,l_receiptdate,l_tax,l_shipmode,l_suppkey,l_shipdate,l_commitdate,l_partkey,l_orderkey,l_quantity,l_linestatus,l_comment,l_extendedprice,l_linenumber,l_discount,l_shipinstruct]
| MANUAL | FUNDAMENTALS | | 2023-07-13 10:33:44 | PENDING | ONCE
|
++--------+--------------+-------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+---------+----------------------+---------+---------------+
+1 row in set (0.00 sec)
+```
+And view the task status of each column.
+```
+mysql> SHOW ANALYZE TASK STATUS 126039;
++---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+----------------------+----------+
+| task_id | col_name
| message | last_exec_time_in_ms |
state |
++---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+----------------------+----------+
+| 126040 | l_receiptdate
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126041 | l_returnflag
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126042 | l_tax
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126043 | l_shipmode
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126044 | l_suppkey
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126045 | l_shipdate
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126046 | l_commitdate
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126047 | l_partkey
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126048 | l_quantity
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126049 | l_orderkey
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126050 | l_comment
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126051 | l_linestatus
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126052 | l_extendedprice
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126053 | l_linenumber
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126054 | l_shipinstruct
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126055 | l_discount
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126056 |
[l_returnflag,l_receiptdate,l_tax,l_shipmode,l_suppkey,l_shipdate,l_commitdate,l_partkey,l_orderkey,l_quantity,l_linestatus,l_comment,l_extendedprice,l_linenumber,l_discount,l_shipinstruct]
| | 2023-07-13 10:33:56 | FINISHED |
++---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+----------------------+----------+
+17 rows in set (0.01 sec)
+```
+
+- Collect statistics about all tables in the tpch100 database
+
+```
+mysql> ANALYZE DATABASE hive.tpch100;
++--------------+---------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
+| Catalog_Name | DB_Name | Table_Name | Columns
| Job_Id
|
++--------------+---------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
+| hive | tpch100 | partsupp |
[ps_suppkey,ps_availqty,ps_comment,ps_partkey,ps_supplycost]
| 124192 |
+| hive | tpch100 | orders |
[o_orderstatus,o_clerk,o_orderdate,o_shippriority,o_custkey,o_totalprice,o_orderkey,o_comment,o_orderpriority]
| 124199 |
+| hive | tpch100 | lineitem |
[l_returnflag,l_receiptdate,l_tax,l_shipmode,l_suppkey,l_shipdate,l_commitdate,l_partkey,l_orderkey,l_quantity,l_linestatus,l_comment,l_extendedprice,l_linenumber,l_discount,l_shipinstruct]
| 124210 |
+| hive | tpch100 | part |
[p_partkey,p_container,p_name,p_comment,p_brand,p_type,p_retailprice,p_mfgr,p_size]
| 124228 |
+| hive | tpch100 | customer |
[c_custkey,c_phone,c_acctbal,c_mktsegment,c_address,c_nationkey,c_name,c_comment]
| 124239 |
+| hive | tpch100 | supplier |
[s_comment,s_phone,s_nationkey,s_name,s_address,s_acctbal,s_suppkey]
| 124249 |
+| hive | tpch100 | nation |
[n_comment,n_nationkey,n_regionkey,n_name]
| 124258 |
+| hive | tpch100 | region | [r_regionkey,r_comment,r_name]
| 124264
|
++--------------+---------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
+8 rows in set (0.29 sec)
+```
+This operation will submit the collection jobs of all tables under the tpch100
database in batches, which is also executed asynchronously. A job_id will be
created for each table, and the job and task progress of each table can also be
viewed through the job_id.
+
+- synchronous collection
+
+You can use `with sync` to collect table or database statistics synchronously.
At this time, no background task will be created, and the client will block
before the collection is completed, and return until the collection job is
completed.
+```
+mysql> analyze table hive.tpch100.orders with sync;
+Query OK, 0 rows affected (33.19 sec)
+```
+It should be noted that synchronous collection is affected by the
query_timeout session variable. If the collection failed because of timeout,
you need to increase the variable value and try again. for example
+`set query_timeout=3600` (This set timeout interval to 1 hour)
+
+#### periodic collection
+
+Use `with period` to create periodic execution collection job.
+
+`analyze table hive.tpch100.orders with period 86400;`
+
+This statement creates a periodic collection job, the cycle is 1 day, and the
statistical information of the `orders` table is automatically collected and
updated every 24 hours.
+
+### Job management
+
+The method of job management is also the same as that of the internal table,
including functions such as viewing jobs, viewing tasks, and deleting jobs.
Please refer to the manage job section of [Internal table
statistics](../query-acceleration/statistics.md)
+
+- Show all job status
+
+```
+mysql> SHOW ANALYZE;
++--------+--------------+-------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+---------+----------------------+----------+---------------+
+| job_id | catalog_name | db_name | tbl_name | col_name
| job_type | analysis_type | message | last_exec_time_in_ms |
state | schedule_type |
++--------+--------------+-------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+---------+----------------------+----------+---------------+
+| 12152 | hive | tpch100 | partsupp |
[ps_suppkey,ps_availqty,ps_comment,ps_partkey,ps_supplycost]
| MANUAL | FUNDAMENTALS | | 2023-07-11
15:57:16 | FINISHED | ONCE |
+| 12159 | hive | tpch100 | orders |
[o_orderstatus,o_clerk,o_orderdate,o_shippriority,o_custkey,o_totalprice,o_orderkey,o_comment,o_orderpriority]
| MANUAL | FUNDAMENTALS | | 2023-07-11 15:57:24 | FINISHED | ONCE
|
++--------+--------------+-------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+---------+----------------------+----------+---------------+
+```
+
+- Show all tasks status of a job
+
+```
+mysql> SHOW ANALYZE TASK STATUS 12152;
++---------+--------------------------------------------------------------+---------+----------------------+----------+
+| task_id | col_name |
message | last_exec_time_in_ms | state |
++---------+--------------------------------------------------------------+---------+----------------------+----------+
+| 12153 | ps_availqty |
| 2023-07-11 15:56:58 | FINISHED |
+| 12154 | ps_suppkey |
| 2023-07-11 15:56:57 | FINISHED |
+| 12155 | ps_comment |
| 2023-07-11 15:57:16 | FINISHED |
+| 12156 | ps_supplycost |
| 2023-07-11 15:56:57 | FINISHED |
+| 12157 | ps_partkey |
| 2023-07-11 15:56:58 | FINISHED |
+| 12158 | [ps_suppkey,ps_availqty,ps_comment,ps_partkey,ps_supplycost] |
| 2023-07-11 15:56:57 | FINISHED |
++---------+--------------------------------------------------------------+---------+----------------------+----------+
+```
+
+- Terminate unfinished jobs
+
+```
+KILL ANALYZE [job_id]
+```
+
+- Delete periodic collection job
+
+```
+DROP ANALYZE JOB [JOB_ID]
+```
+
+### Show statistics
+
+Show statistics includes show table statistics (number of rows) and column
statistics. Please refer to View statistics in [Internal Table
Statistics](../query-acceleration/statistics.md)
+
+#### Table statistics
+
+```
+mysql> SHOW TABLE STATS hive.tpch100.orders;
++-----------+---------------------+---------------------+
+| row_count | update_time | last_analyze_time |
++-----------+---------------------+---------------------+
+| 150000000 | 2023-07-11 23:01:49 | 2023-07-11 23:01:44 |
++-----------+---------------------+---------------------+
+```
+
+#### Column statistics
+```
+SHOW COLUMN [cached] stats hive.tpch100.orders;
+```
+
+View the column statistics of a table. If the cached parameter is specified,
the column information of the specified table that has been loaded into the
cache is displayed.
+
+```
+mysql> SHOW COLUMN stats hive.tpch100.orders;
++-----------------+-------+--------------+----------+----------------------+---------------+-----------------------+----------------------------+
+| column_name | count | ndv | num_null | data_size |
avg_size_byte | min | max |
++-----------------+-------+--------------+----------+----------------------+---------------+-----------------------+----------------------------+
+| o_orderstatus | 1.5E8 | 3.0 | 0.0 | 1.50000001E8 |
1.0 | 'F' | 'P' |
+| o_clerk | 1.5E8 | 100836.0 | 0.0 | 2.250000015E9 |
15.0 | 'Clerk#000000001' | 'Clerk#000100000' |
+| o_orderdate | 1.5E8 | 2417.0 | 0.0 | 6.00000004E8 |
4.0 | '1992-01-01' | '1998-08-02' |
+| o_shippriority | 1.5E8 | 1.0 | 0.0 | 6.00000004E8 |
4.0 | 0 | 0 |
+| o_custkey | 1.5E8 | 1.0023982E7 | 0.0 | 6.00000004E8 |
4.0 | 1 | 14999999 |
+| o_totalprice | 1.5E8 | 3.4424096E7 | 0.0 | 1.200000008E9 |
8.0 | 811.73 | 591036.15 |
+| o_orderkey | 1.5E8 | 1.51621184E8 | 0.0 | 1.200000008E9 |
8.0 | 1 | 600000000 |
+| o_comment | 1.5E8 | 1.10204136E8 | 0.0 | 7.275038757500258E9 |
48.50025806 | ' Tiresias about the' | 'zzle? unusual requests w' |
+| o_orderpriority | 1.5E8 | 5.0 | 0.0 | 1.2600248124001656E9 |
8.40016536 | '1-URGENT' | '5-LOW' |
++-----------------+-------+--------------+----------+----------------------+---------------+-----------------------+----------------------------+
+```
+
+### Modify the statistics
+
+Modify statistics supports users to manually modify column statistics. You can
modify the row_count, ndv, num_nulls, min_value, max_value, data_size and other
information of the specified column.
+
+Please refer to Modify the statistics in [Internal Table
Statistics](../query-acceleration/statistics.md)
+
+```
+mysql> ALTER TABLE hive.tpch100.orders MODIFY COLUMN o_orderstatus SET STATS
('row_count'='6001215');
+Query OK, 0 rows affected (0.03 sec)
+
+mysql> SHOW COLUMN stats hive.tpch100.orders;
++-----------------+-----------+--------------+----------+----------------------+---------------+-----------------------+----------------------------+
+| column_name | count | ndv | num_null | data_size
| avg_size_byte | min | max |
++-----------------+-----------+--------------+----------+----------------------+---------------+-----------------------+----------------------------+
+| o_orderstatus | 6001215.0 | 0.0 | 0.0 | 0.0
| 0.0 | 'NULL' | 'NULL' |
+| o_clerk | 1.5E8 | 100836.0 | 0.0 | 2.250000015E9
| 15.0 | 'Clerk#000000001' | 'Clerk#000100000' |
+| o_orderdate | 1.5E8 | 2417.0 | 0.0 | 6.00000004E8
| 4.0 | '1992-01-01' | '1998-08-02' |
+| o_shippriority | 1.5E8 | 1.0 | 0.0 | 6.00000004E8
| 4.0 | 0 | 0 |
+| o_custkey | 1.5E8 | 1.0023982E7 | 0.0 | 6.00000004E8
| 4.0 | 1 | 14999999 |
+| o_totalprice | 1.5E8 | 3.4424096E7 | 0.0 | 1.200000008E9
| 8.0 | 811.73 | 591036.15 |
+| o_orderkey | 1.5E8 | 1.51621184E8 | 0.0 | 1.200000008E9
| 8.0 | 1 | 600000000 |
+| o_comment | 1.5E8 | 1.10204136E8 | 0.0 | 7.275038757500258E9
| 48.50025806 | ' Tiresias about the' | 'zzle? unusual requests w' |
+| o_orderpriority | 1.5E8 | 5.0 | 0.0 | 1.2600248124001656E9
| 8.40016536 | '1-URGENT' | '5-LOW' |
++-----------------+-----------+--------------+----------+----------------------+---------------+-----------------------+----------------------------+
+```
+
+### Delete stattistics
+
+Deleting table statistics supports users to delete table row count and column
statistics of a table. If the user specifies the column names to be deleted,
only the statistics of these columns will be deleted. If not specified, the
statistical information of all columns of the entire table and the number of
rows of the table will be deleted.
+
+Please refer to Delete statistics in [Internal Table
Statistics](../query-acceleration/statistics.md)
+
+- Delete statistics of the entire table
+
+```
+DROP STATS hive.tpch100.orders
+```
+
+- Delete the statistics of certain columns in the table
+
+```
+DROP STATS hive.tpch100.orders (o_orderkey, o_orderdate)
+```
+
+## Implementation principle
+### Statistics Data Sources
+
+The optimizer (Nereids) reads statistical information through the cache, and
there are two data sources for the cache to load data from.
+
+The first data source is the internal statistics table, and the data in the
statistics table is collected by the user to execute analyze statement. The
structure of this part is the same as that of the internal table. Users can
execute the analyze statement on the external table to collect statistical
information just like analyzing the internal tables.
+
+Different from the internal table, the statistic cache for external table has
a second data source, the stats collector. The stats collector defines some
interfaces for obtaining statistical information from external catalog. Hive
metastore and Iceberg metadata. These interfaces can obtain existing
statistical information in external catalog. Take hive as an example. If the
user has performed the analyze operation in hive, then when querying in Doris,
Doris can directly load the existing [...]
+
+The Stats collector is automatically executed when there is no data in the
statistics table, and it is transparent to the user, user does not need to
execute commands or make any settings.
+
+### Cache loading
+
+The loading sequence of the cache is firstly loaded through the Statistics
table. If there is information in the Statistics table, it means that the user
has performed analyze operation in doris. The collected statistical information
through analyze is the most accurate, so we prioritize loading from the
Statistics table. If you find that there is no information about the currently
required table in Statistics, Doris will try to obtain it from an external data
source through stats collec [...]
+
+Since the cache is loaded asynchronously, the statistical information may not
be available for the first query, because the cache loading has just been
triggered at this time. But in general, it can be guaranteed that when a table
is queried for the second time, the optimizer can obtain its statistical
information from the cache.
diff --git a/docs/en/docs/query-acceleration/statistics.md
b/docs/en/docs/query-acceleration/statistics.md
index 30dd42b150..3280bfca15 100644
--- a/docs/en/docs/query-acceleration/statistics.md
+++ b/docs/en/docs/query-acceleration/statistics.md
@@ -80,12 +80,16 @@ Column statistics collection syntax:
```SQL
ANALYZE TABLE | DATABASE table_name | db_name
- [ (column_name [, ...]) ] [ [ WITH SYNC ] [ WITH INCREMENTAL ] [ WITH
SAMPLE PERCENT | ROWS ] [ WITH PERIOD ] [WITH HISTOGRAM]] [ PROPERTIES
("key" = "value", ...) ];
+ [ PARTITIONS (partition_name [, ...]) ]
+ [ (column_name [, ...]) ]
+ [ [ WITH SYNC ] [ WITH INCREMENTAL ] [ WITH SAMPLE PERCENT | ROWS ] [ WITH
PERIOD ] [WITH HISTOGRAM]]
+ [ PROPERTIES ("key" = "value", ...) ];
```
Explanation:
- Table_name: The target table for the specified. It can be a
`db_name.table_name` form.
+- partition_name: The specified target partitions(for hive external table
only)。Must be partitions exist in `table_name`. Multiple partition names are
separated by commas. e.g. (nation=US/city=Washington)
- Column_name: The specified target column. Must be `table_name` a column that
exists in. Multiple column names are separated by commas.
- Sync: Synchronizes the collection of statistics. Return after collection. If
not specified, it will be executed asynchronously and the job ID will be
returned.
- Incremental: Incrementally gather statistics. Incremental collection of
histogram statistics is not supported.
diff --git a/docs/sidebars.json b/docs/sidebars.json
index fc8323f689..3c0920b94e 100644
--- a/docs/sidebars.json
+++ b/docs/sidebars.json
@@ -214,6 +214,7 @@
},
"lakehouse/file",
"lakehouse/filecache",
+ "lakehouse/external_statistics",
"lakehouse/faq",
"lakehouse/fs_benchmark_tool"
]
diff --git a/docs/zh-CN/docs/lakehouse/external_statistics.md
b/docs/zh-CN/docs/lakehouse/external_statistics.md
new file mode 100644
index 0000000000..586571f5b6
--- /dev/null
+++ b/docs/zh-CN/docs/lakehouse/external_statistics.md
@@ -0,0 +1,274 @@
+---
+{
+ "title": "外表统计信息",
+ "language": "zh-CN"
+}
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# 外表统计信息
+
+外表统计信息的收集方式和收集内容与内表基本一致,详细信息可以参考[内表统计信息](../query-acceleration/statistics.md)。目前支持对Hive,Iceberg和Hudi等外部表的收集。
+
+外表暂不支持的功能包括
+
+1. 暂不支持直方图收集
+2. 暂不支持分区的增量收集和更新
+3. 暂不支持自动收集(with auto),用户可以使用周期性收集(with period)来代替
+4. 暂不支持抽样收集
+
+下面主要介绍一下外表统计信息收集的示例和实现原理。
+
+## 使用示例
+
+这里主要展示在Doris中通过执行analyze命令收集外表统计信息的相关示例。除了上文提到的外表暂不支持的4个功能,其余和内表使用方式相同。下面以hive.tpch100数据库为例进行展示。tpch100数据库中包含lineitem,orders,region等8张表。
+
+### 信息收集
+
+外表支持手动一次性收集和周期性收集两种收集方式。
+
+#### 手动一次性收集
+
+- 收集lineitem表的表信息以及全部列的信息:
+```
+mysql> ANALYZE TABLE hive.tpch100.lineitem;
++--------------+-------------------------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
+| Catalog_Name | DB_Name | Table_Name | Columns
| Job_Id |
++--------------+-------------------------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
+| hive | default_cluster:tpch100 | lineitem |
[l_returnflag,l_receiptdate,l_tax,l_shipmode,l_suppkey,l_shipdate,l_commitdate,l_partkey,l_orderkey,l_quantity,l_linestatus,l_comment,l_extendedprice,l_linenumber,l_discount,l_shipinstruct]
| 126039 |
++--------------+-------------------------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
+1 row in set (0.03 sec)
+```
+此操作是异步执行,会在后台创建收集任务,可以通过job_id查看任务进度。
+```
+mysql> SHOW ANALYZE 126039;
++--------+--------------+-------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+---------+----------------------+---------+---------------+
+| job_id | catalog_name | db_name | tbl_name | col_name
| job_type | analysis_type | message | last_exec_time_in_ms |
state | schedule_type |
++--------+--------------+-------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+---------+----------------------+---------+---------------+
+| 126039 | hive | default_cluster:tpch100 | lineitem |
[l_returnflag,l_receiptdate,l_tax,l_shipmode,l_suppkey,l_shipdate,l_commitdate,l_partkey,l_orderkey,l_quantity,l_linestatus,l_comment,l_extendedprice,l_linenumber,l_discount,l_shipinstruct]
| MANUAL | FUNDAMENTALS | | 2023-07-13 10:33:44 | PENDING | ONCE
|
++--------+--------------+-------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+---------+----------------------+---------+---------------+
+1 row in set (0.00 sec)
+```
+以及查看每一列的task状态。
+```
+mysql> SHOW ANALYZE TASK STATUS 126039;
++---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+----------------------+----------+
+| task_id | col_name
| message | last_exec_time_in_ms |
state |
++---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+----------------------+----------+
+| 126040 | l_receiptdate
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126041 | l_returnflag
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126042 | l_tax
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126043 | l_shipmode
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126044 | l_suppkey
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126045 | l_shipdate
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126046 | l_commitdate
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126047 | l_partkey
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126048 | l_quantity
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126049 | l_orderkey
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126050 | l_comment
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126051 | l_linestatus
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126052 | l_extendedprice
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126053 | l_linenumber
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126054 | l_shipinstruct
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126055 | l_discount
| | 2023-07-13 10:33:44 |
RUNNING |
+| 126056 |
[l_returnflag,l_receiptdate,l_tax,l_shipmode,l_suppkey,l_shipdate,l_commitdate,l_partkey,l_orderkey,l_quantity,l_linestatus,l_comment,l_extendedprice,l_linenumber,l_discount,l_shipinstruct]
| | 2023-07-13 10:33:56 | FINISHED |
++---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+----------------------+----------+
+17 rows in set (0.01 sec)
+```
+
+- 收集tpch100数据库所有表的信息
+
+```
+mysql> ANALYZE DATABASE hive.tpch100;
++--------------+---------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
+| Catalog_Name | DB_Name | Table_Name | Columns
| Job_Id
|
++--------------+---------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
+| hive | tpch100 | partsupp |
[ps_suppkey,ps_availqty,ps_comment,ps_partkey,ps_supplycost]
| 124192 |
+| hive | tpch100 | orders |
[o_orderstatus,o_clerk,o_orderdate,o_shippriority,o_custkey,o_totalprice,o_orderkey,o_comment,o_orderpriority]
| 124199 |
+| hive | tpch100 | lineitem |
[l_returnflag,l_receiptdate,l_tax,l_shipmode,l_suppkey,l_shipdate,l_commitdate,l_partkey,l_orderkey,l_quantity,l_linestatus,l_comment,l_extendedprice,l_linenumber,l_discount,l_shipinstruct]
| 124210 |
+| hive | tpch100 | part |
[p_partkey,p_container,p_name,p_comment,p_brand,p_type,p_retailprice,p_mfgr,p_size]
| 124228 |
+| hive | tpch100 | customer |
[c_custkey,c_phone,c_acctbal,c_mktsegment,c_address,c_nationkey,c_name,c_comment]
| 124239 |
+| hive | tpch100 | supplier |
[s_comment,s_phone,s_nationkey,s_name,s_address,s_acctbal,s_suppkey]
| 124249 |
+| hive | tpch100 | nation |
[n_comment,n_nationkey,n_regionkey,n_name]
| 124258 |
+| hive | tpch100 | region | [r_regionkey,r_comment,r_name]
| 124264
|
++--------------+---------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
+8 rows in set (0.29 sec)
+```
+此操作会批量提交tpch100数据库下所有表的收集任务,也是异步执行,会给每个表创建一个job_id,也可以通过job_id查看每张表的任务进度。
+
+- 同步收集
+
+可以使用with sync同步收集表或数据库的统计信息。这时不会创建后台任务,客户端在收集完成之前会block住,直到收集任务执行完成再返回。
+```
+mysql> analyze table hive.tpch100.orders with sync;
+Query OK, 0 rows affected (33.19 sec)
+```
+需要注意的是,同步收集受query_timeout session变量影响,如果超时失败,需要调大该变量后重试。比如:
+`set query_timeout=3600` (超时时间设置为1小时)
+
+#### 周期性收集
+
+使用with period可以设置周期性的执行收集任务:
+
+`analyze table hive.tpch100.orders with period 86400;`
+
+这条语句创建一个周期性收集的任务,周期是1天,每天自动收集和更新orders表的统计信。
+
+### 任务管理
+
+任务管理的方式也和内表相同,主要包括查看job,查看task,删除job等功能。请参考[内表统计信息](../query-acceleration/statistics.md)任务管理部分。
+
+- 查看所有job状态
+
+```
+mysql> SHOW ANALYZE;
++--------+--------------+-------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+---------+----------------------+----------+---------------+
+| job_id | catalog_name | db_name | tbl_name | col_name
| job_type | analysis_type | message | last_exec_time_in_ms |
state | schedule_type |
++--------+--------------+-------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+---------+----------------------+----------+---------------+
+| 12152 | hive | tpch100 | partsupp |
[ps_suppkey,ps_availqty,ps_comment,ps_partkey,ps_supplycost]
| MANUAL | FUNDAMENTALS | | 2023-07-11
15:57:16 | FINISHED | ONCE |
+| 12159 | hive | tpch100 | orders |
[o_orderstatus,o_clerk,o_orderdate,o_shippriority,o_custkey,o_totalprice,o_orderkey,o_comment,o_orderpriority]
| MANUAL | FUNDAMENTALS | | 2023-07-11 15:57:24 | FINISHED | ONCE
|
++--------+--------------+-------------------------+----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+---------------+---------+----------------------+----------+---------------+
+```
+
+- 查看一个job的所有task状态
+
+```
+mysql> SHOW ANALYZE TASK STATUS 12152;
++---------+--------------------------------------------------------------+---------+----------------------+----------+
+| task_id | col_name |
message | last_exec_time_in_ms | state |
++---------+--------------------------------------------------------------+---------+----------------------+----------+
+| 12153 | ps_availqty |
| 2023-07-11 15:56:58 | FINISHED |
+| 12154 | ps_suppkey |
| 2023-07-11 15:56:57 | FINISHED |
+| 12155 | ps_comment |
| 2023-07-11 15:57:16 | FINISHED |
+| 12156 | ps_supplycost |
| 2023-07-11 15:56:57 | FINISHED |
+| 12157 | ps_partkey |
| 2023-07-11 15:56:58 | FINISHED |
+| 12158 | [ps_suppkey,ps_availqty,ps_comment,ps_partkey,ps_supplycost] |
| 2023-07-11 15:56:57 | FINISHED |
++---------+--------------------------------------------------------------+---------+----------------------+----------+
+```
+
+- 终止未完成的job
+
+```
+KILL ANALYZE [job_id]
+```
+
+- 删除周期性收集job
+
+```
+DROP ANALYZE JOB [JOB_ID]
+```
+
+### 信息查看
+
+信息的查看包括表的统计信息(表的行数)查看和列统计信息查看,请参考[内表统计信息](../query-acceleration/statistics.md)查看统计信息部分。
+
+#### 表统计信息
+
+```
+mysql> SHOW TABLE STATS hive.tpch100.orders;
++-----------+---------------------+---------------------+
+| row_count | update_time | last_analyze_time |
++-----------+---------------------+---------------------+
+| 150000000 | 2023-07-11 23:01:49 | 2023-07-11 23:01:44 |
++-----------+---------------------+---------------------+
+```
+
+#### 列统计信息
+```
+SHOW COLUMN [cached] stats hive.tpch100.orders;
+```
+
+查看statistics表中指定table的列统计信息,如果指定cached参数,则展示的是指定表已加载到缓存中的列信息。
+
+```
+mysql> SHOW COLUMN stats hive.tpch100.orders;
++-----------------+-------+--------------+----------+----------------------+---------------+-----------------------+----------------------------+
+| column_name | count | ndv | num_null | data_size |
avg_size_byte | min | max |
++-----------------+-------+--------------+----------+----------------------+---------------+-----------------------+----------------------------+
+| o_orderstatus | 1.5E8 | 3.0 | 0.0 | 1.50000001E8 |
1.0 | 'F' | 'P' |
+| o_clerk | 1.5E8 | 100836.0 | 0.0 | 2.250000015E9 |
15.0 | 'Clerk#000000001' | 'Clerk#000100000' |
+| o_orderdate | 1.5E8 | 2417.0 | 0.0 | 6.00000004E8 |
4.0 | '1992-01-01' | '1998-08-02' |
+| o_shippriority | 1.5E8 | 1.0 | 0.0 | 6.00000004E8 |
4.0 | 0 | 0 |
+| o_custkey | 1.5E8 | 1.0023982E7 | 0.0 | 6.00000004E8 |
4.0 | 1 | 14999999 |
+| o_totalprice | 1.5E8 | 3.4424096E7 | 0.0 | 1.200000008E9 |
8.0 | 811.73 | 591036.15 |
+| o_orderkey | 1.5E8 | 1.51621184E8 | 0.0 | 1.200000008E9 |
8.0 | 1 | 600000000 |
+| o_comment | 1.5E8 | 1.10204136E8 | 0.0 | 7.275038757500258E9 |
48.50025806 | ' Tiresias about the' | 'zzle? unusual requests w' |
+| o_orderpriority | 1.5E8 | 5.0 | 0.0 | 1.2600248124001656E9 |
8.40016536 | '1-URGENT' | '5-LOW' |
++-----------------+-------+--------------+----------+----------------------+---------------+-----------------------+----------------------------+
+```
+
+### 信息修改
+
+修改信息支持用户手动修改列统计信息。可以修改指定列的row_count, ndv, num_nulls, min_value, max_value,
data_size等信息。
+请参考[内表统计信息](../query-acceleration/statistics.md)修改统计信息部分。
+
+```
+mysql> ALTER TABLE hive.tpch100.orders MODIFY COLUMN o_orderstatus SET STATS
('row_count'='6001215');
+Query OK, 0 rows affected (0.03 sec)
+
+mysql> SHOW COLUMN stats hive.tpch100.orders;
++-----------------+-----------+--------------+----------+----------------------+---------------+-----------------------+----------------------------+
+| column_name | count | ndv | num_null | data_size
| avg_size_byte | min | max |
++-----------------+-----------+--------------+----------+----------------------+---------------+-----------------------+----------------------------+
+| o_orderstatus | 6001215.0 | 0.0 | 0.0 | 0.0
| 0.0 | 'NULL' | 'NULL' |
+| o_clerk | 1.5E8 | 100836.0 | 0.0 | 2.250000015E9
| 15.0 | 'Clerk#000000001' | 'Clerk#000100000' |
+| o_orderdate | 1.5E8 | 2417.0 | 0.0 | 6.00000004E8
| 4.0 | '1992-01-01' | '1998-08-02' |
+| o_shippriority | 1.5E8 | 1.0 | 0.0 | 6.00000004E8
| 4.0 | 0 | 0 |
+| o_custkey | 1.5E8 | 1.0023982E7 | 0.0 | 6.00000004E8
| 4.0 | 1 | 14999999 |
+| o_totalprice | 1.5E8 | 3.4424096E7 | 0.0 | 1.200000008E9
| 8.0 | 811.73 | 591036.15 |
+| o_orderkey | 1.5E8 | 1.51621184E8 | 0.0 | 1.200000008E9
| 8.0 | 1 | 600000000 |
+| o_comment | 1.5E8 | 1.10204136E8 | 0.0 | 7.275038757500258E9
| 48.50025806 | ' Tiresias about the' | 'zzle? unusual requests w' |
+| o_orderpriority | 1.5E8 | 5.0 | 0.0 | 1.2600248124001656E9
| 8.40016536 | '1-URGENT' | '5-LOW' |
++-----------------+-----------+--------------+----------+----------------------+---------------+-----------------------+----------------------------+
+```
+
+### 信息删除
+
+删除外表统计信息支持用户删除一张表的表行数信息和列统计信息。如果用户指定了删除的列名,则只删除这些列的信息。如果不指定,则删除整张表所有列的统计信息以及表行数信息。
+请参考[内表统计信息](../query-acceleration/statistics.md)删除统计信息部分。
+
+- 删除整张表的信息
+
+```
+DROP STATS hive.tpch100.orders
+```
+
+- 删除表中某几列的信息
+
+```
+DROP STATS hive.tpch100.orders (o_orderkey, o_orderdate)
+```
+
+## 实现原理
+### 统计信息数据来源
+
+优化器(Nereids)通过Cache读取统计信息,cache的数据来源有两个。
+
+第一个是内部的statistics表,statistics表的数据通过用户执行analyze语句收集而来。这一部分的架构与内表相同,用户可以像分析内表一样,对外表执行analyze语句来收集统计信息。
+
+与内表不同的是,外表cache的数据还有第二个来源stats collector。stats
collector定义了一些接口,用来从外部数据源获取统计信息。比如目前已经支持的hive
metastore和Iceberg两种数据源,这些接口可以获取外部数据源中已有的统计信息。以hive为例,如果用户在hive中执行过analyze操作,那么在Doris中查询的时候,Doris可以直接从hive
metastore中加载已有的统计信息到缓存中,包括表的行数、列的最大最小值等。如果外部数据源也没有统计信息,stats
connector会根据表中数据文件的大小和表的schema,大致估算一个行数提供给优化器,在这种情况下,列的统计信息是缺失的,可能导致优化器生成比较低效的执行计划。
+
+Stats collector在statistics表中无数据时自动执行,对用户透明,用户无需执行命令或进行设置。
+
+### 缓存的加载
+
+缓存的加载顺序是,首先通过Statistics表加载,如果Statistics表中有信息,说明用户在doris中执行过analyze操作,这样收集上来的统计信息是最准确的,所以我们优先从Statistics表中加载。如果发现Statistics中没有当前所需表的信息,再通过stats
collector从外部数据源获取。如果外部数据源也没有,stats collector会估算一个行数。
+由于缓存是异步加载的,第一次query的时候可能没法利用到任何统计信息,因为这时候刚刚触发缓存加载。但一般情况下,可以保证第二次查询某张表的时候,优化器可以从缓存中获取到它的统计信息。
diff --git a/docs/zh-CN/docs/query-acceleration/statistics.md
b/docs/zh-CN/docs/query-acceleration/statistics.md
index 901ff14b21..f90e02b9b5 100644
--- a/docs/zh-CN/docs/query-acceleration/statistics.md
+++ b/docs/zh-CN/docs/query-acceleration/statistics.md
@@ -80,6 +80,7 @@ Doris 查询优化器使用统计信息来确定查询最有效的执行计划
```SQL
ANALYZE TABLE | DATABASE table_name | db_name
+ [ PARTITIONS (partition_name [, ...]) ]
[ (column_name [, ...]) ]
[ [ WITH SYNC ] [ WITH INCREMENTAL ] [ WITH SAMPLE PERCENT | ROWS ] [ WITH
PERIOD ] [WITH HISTOGRAM]]
[ PROPERTIES ("key" = "value", ...) ];
@@ -88,6 +89,7 @@ ANALYZE TABLE | DATABASE table_name | db_name
其中:
- table_name: 指定的的目标表。可以是 `db_name.table_name` 形式。
+- partition_name: 指定的目标分区(目前只针对Hive外表)。必须是 `table_name`
中存在的分区,多个列名称用逗号分隔。分区名样例:event_date=20230706, nation=CN/city=Beijing
- column_name: 指定的目标列。必须是 `table_name` 中存在的列,多个列名称用逗号分隔。
- sync:同步收集统计信息。收集完后返回。若不指定则异步执行并返回任务 ID。
- incremental:增量收集统计信息。不支持增量收集直方图统计信息。
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]