This is an automated email from the ASF dual-hosted git repository.
luzhijing pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new 9a4fea77de2 [blog](update) Update data partition blog (#996)
9a4fea77de2 is described below
commit 9a4fea77de2beaf9900e5569b53585ded62d33f4
Author: KassieZ <[email protected]>
AuthorDate: Thu Aug 15 10:29:14 2024 +0800
[blog](update) Update data partition blog (#996)
---
blog/apache-doris-vs-rockset.md | 2 +-
blog/auto-partition-in-apache-doris.md | 483 +++++++++++++++++++++
...olution-of-the-apache-doris-execution-engine.md | 2 -
blog/migrate-lakehouse-from-bigquery-to-doris.md | 2 +-
blog/release-note-2.0.14.md | 2 +-
src/components/recent-blogs/recent-blogs.data.ts | 10 +-
src/constant/newsletter.data.ts | 15 +-
static/images/auto-partition-in-apache-doris.jpg | Bin 0 -> 540079 bytes
.../images/auto-partition-ingestion-workflow-2.png | Bin 0 -> 85479 bytes
.../images/auto-partition-ingestion-workflow.png | Bin 0 -> 125212 bytes
static/images/auto-partition-performance-2.png | Bin 0 -> 73728 bytes
static/images/auto-partition-performance.png | Bin 0 -> 167793 bytes
...olution-of-partitioning-strategies-in-Doris.png | Bin 0 -> 80474 bytes
static/images/performance-comparison.png | Bin 0 -> 136496 bytes
14 files changed, 499 insertions(+), 17 deletions(-)
diff --git a/blog/apache-doris-vs-rockset.md b/blog/apache-doris-vs-rockset.md
index 79e7e82d98a..a7a443be944 100644
--- a/blog/apache-doris-vs-rockset.md
+++ b/blog/apache-doris-vs-rockset.md
@@ -7,7 +7,7 @@
'author': 'Zaki Lu',
'tags': ['Top News'],
'picked': "true",
- 'order': "3",
+ 'order': "4",
"image": '/images/doris-vs-rockset.jpeg'
}
diff --git a/blog/auto-partition-in-apache-doris.md
b/blog/auto-partition-in-apache-doris.md
new file mode 100644
index 00000000000..ec1be7fe55a
--- /dev/null
+++ b/blog/auto-partition-in-apache-doris.md
@@ -0,0 +1,483 @@
+---
+{
+ 'title': 'Automatic and flexible data sharding: Auto Partition in Apache
Doris',
+ 'summary': "Apache Doris 2.1.0 introduces Auto Partition. It supports
partitioning data by RANGE or by LIST and further enhances flexibility on top
of automatic partitioning.",
+ 'description': "Apache Doris 2.1.0 introduces Auto Partition. It supports
partitioning data by RANGE or by LIST and further enhances flexibility on top
of automatic partitioning.",
+ 'date': '2024-08-14',
+ 'author': 'Apache Doris',
+ 'tags': ['Tech Sharing'],
+ 'picked': "true",
+ 'order': "1",
+ "image": '/images/auto-partition-in-apache-doris.jpg'
+}
+
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+To handle large datasets, distributed databases introduce strategies like
partitioning and bucketing. Data is divided into smaller units based on
specific rules and distributed across different nodes, so databases can perform
parallel processing for higher performance and data management flexibility.
+
+Like in many databases, [Apache Doris](https://doris.apache.org) shards data
into partitions, and then a partition is further divided into buckets.
**Partitions** are typically defined by time or other continuous values. This
allows query engines to quickly locate the target data during queries by
pruning irrelevant data ranges.
+
+**Bucketing**, on the other hand, distributes data based on the hash values of
one or more columns, which prevents data skew.
+
+Prior to version [2.1.0](https://doris.apache.org/blog/release-note-2.1.0),
there are two way you can create data partitions in Apache Doris:
+
+- **[Manual
Partition](https://doris.apache.org/docs/table-design/data-partition/#manual-partitioning)**:
Users specify the partitions in the table creation statement, or modify them
through DDL statements afterwards.
+
+- **[Dynamic
Partition](https://doris.apache.org/docs/table-design/data-partition/#dynamic-partition)**:
The system automatically maintains partitions within a pre-defined range based
on the data ingestion time.
+
+In Apache Doris 2.1.0, we have introduced **[Auto
Partition](https://doris.apache.org/docs/table-design/data-partition/#auto-partition)**.
It supports partitioning data by RANGE or by LIST and further enhances
flexibility on top of automatic partitioning.
+
+## Evolution of partitioning strategies in Doris
+
+In the design of data distribution, we focus more on partition planning,
because the choice of partition columns and partition intervals heavily depends
on the actual data distribution patterns, and a good partition design can
largely improve the query and storage efficiency of the table.
+
+In Doris, the data table is divided into partitions and then buckets in a
hierarchical manner. The data within the same bucket then forms a data
**tablet**, which is the minimum physical storage unit in Doris for data
replication, inter-cluster data scheduling, and load balancing.
+
+
+
+
+### Manual Partition
+
+Doris allows users to manually create data partitions by RANGE and by LIST.
+
+For time-stamped data like logs and transaction records, users typically
create partitions based on the time dimension. Here's an example of the CREATE
TABLE statement:
+
+```sql
+CREATE TABLE IF NOT EXISTS example_range_tbl
+(
+ `user_id` LARGEINT NOT NULL COMMENT "User ID",
+ `date` DATE NOT NULL COMMENT "Data import date",
+ `timestamp` DATETIME NOT NULL COMMENT "Data import timestamp",
+ `city` VARCHAR(20) COMMENT "Location of user",
+ `age` SMALLINT COMMENT "Age of user",
+ `sex` TINYINT COMMENT "Sex of user",
+ `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT
"Last visit date of user",
+ `cost` BIGINT SUM DEFAULT "0" COMMENT "User consumption",
+ `max_dwell_time` INT MAX DEFAULT "0" COMMENT "Maximum dwell time of user",
+ `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "Minimum dwell time of
user"
+)
+ENGINE=OLAP
+AGGREGATE KEY(`user_id`, `date`, `timestamp`, `city`, `age`, `sex`)
+PARTITION BY RANGE(`date`)
+(
+ PARTITION `p201701` VALUES LESS THAN ("2017-02-01"),
+ PARTITION `p201702` VALUES LESS THAN ("2017-03-01"),
+ PARTITION `p201703` VALUES LESS THAN ("2017-04-01"),
+ PARTITION `p2018` VALUES [("2018-01-01"), ("2019-01-01"))
+)
+DISTRIBUTED BY HASH(`user_id`) BUCKETS 16
+PROPERTIES
+(
+ "replication_num" = "1"
+);
+```
+
+The table is partitioned by the data import date `date`, and 4 partitions have
been pre-created. Within each partition, the data is further divided into 16
buckets based on the hash value of the `user_id`.
+
+With this partitioning and bucketing design, when querying data from 2018
onwards, the system only need to scan the `p2018` partition. This is what the
query SQL looks like:
+
+```sql
+mysql> desc select count() from example_range_tbl where date >= '20180101';
++--------------------------------------------------------------------------------------+
+| Explain String(Nereids Planner)
|
++--------------------------------------------------------------------------------------+
+| PLAN FRAGMENT 0
|
+| OUTPUT EXPRS:
|
+| count(*)[#11]
|
+| PARTITION: UNPARTITIONED
|
+|
|
+| ......
|
+|
|
+| 0:VOlapScanNode(193)
|
+| TABLE: test.example_range_tbl(example_range_tbl), PREAGGREGATION: OFF.
|
+| PREDICATES: (date[#1] >= '2018-01-01')
|
+| partitions=1/4 (p2018), tablets=16/16, tabletList=561490,561492,561494
... |
+| cardinality=0, avgRowSize=0.0, numNodes=1
|
+| pushAggOp=NONE
|
+|
|
++--------------------------------------------------------------------------------------+
+```
+
+If the data is distributed unevenly across partitions, the hash-based
bucketing mechanism can further divide the data based on the `user_id`. This
helps to avoid load imbalance on some machines during querying and storage.
+
+However, in real-world business scenarios, one cluster may have tens of
thousands of tables, which means it is impossible to manage them manually.
+
+```sql
+CREATE TABLE `DAILY_TRADE_VALUE`
+(
+ `TRADE_DATE` datev2 NOT NULL COMMENT 'Trade date',
+ `TRADE_ID` varchar(40) NOT NULL COMMENT 'Trade ID',
+ ......
+)
+UNIQUE KEY(`TRADE_DATE`, `TRADE_ID`)
+PARTITION BY RANGE(`TRADE_DATE`)
+(
+ PARTITION p_200001 VALUES [('2000-01-01'), ('2000-02-01')),
+ PARTITION p_200002 VALUES [('2000-02-01'), ('2000-03-01')),
+ PARTITION p_200003 VALUES [('2000-03-01'), ('2000-04-01')),
+ PARTITION p_200004 VALUES [('2000-04-01'), ('2000-05-01')),
+ PARTITION p_200005 VALUES [('2000-05-01'), ('2000-06-01')),
+ PARTITION p_200006 VALUES [('2000-06-01'), ('2000-07-01')),
+ PARTITION p_200007 VALUES [('2000-07-01'), ('2000-08-01')),
+ PARTITION p_200008 VALUES [('2000-08-01'), ('2000-09-01')),
+ PARTITION p_200009 VALUES [('2000-09-01'), ('2000-10-01')),
+ PARTITION p_200010 VALUES [('2000-10-01'), ('2000-11-01')),
+ PARTITION p_200011 VALUES [('2000-11-01'), ('2000-12-01')),
+ PARTITION p_200012 VALUES [('2000-12-01'), ('2001-01-01')),
+ PARTITION p_200101 VALUES [('2001-01-01'), ('2001-02-01')),
+ ......
+)
+DISTRIBUTED BY HASH(`TRADE_DATE`) BUCKETS 10
+PROPERTIES (
+ ......
+);
+```
+
+In the above example, data is partitioned on a monthly basis. This requires
the database administrator (DBA) to manually add a new partition each month and
maintain table schema regularly. Imagine the case of real-time data processing,
where you might need to create partitions daily or even hourly, manually doing
this is no long a choice. That's why we introduced Dynamic Partition.
+
+### Dynamic Partition
+
+By Dynamic Partition, Doris automatically creates and reclaims data partitions
as long as the user specifies the partition unit, the number of historical
partitions, and the number of future partitions. This functionality relies on a
fixed thread on the Doris Frontend. It continuously polls and checks for new
partitions to be created or old partitions to be reclaimed, and updates the
partition schema of the table.
+
+This is an example CREATE TABLE statement for a table which is partitioned by
day. The `start` and `end` parameters are set to `-7` and `3`, respectively,
meaning that data partitions for the next 3 days will be pre-created and the
historical partitions that are older than 7 days will be reclaimed.
+
+```sql
+CREATE TABLE `DAILY_TRADE_VALUE`
+(
+ `TRADE_DATE` datev2 NOT NULL COMMENT 'Trade date',
+ `TRADE_ID` varchar(40) NOT NULL COMMENT 'Trade ID',
+ ......
+)
+UNIQUE KEY(`TRADE_DATE`, `TRADE_ID`)
+PARTITION BY RANGE(`TRADE_DATE`) ()
+DISTRIBUTED BY HASH(`TRADE_DATE`) BUCKETS 10
+PROPERTIES (
+ "dynamic_partition.enable" = "true",
+ "dynamic_partition.time_unit" = "DAY",
+ "dynamic_partition.start" = "-7",
+ "dynamic_partition.end" = "3",
+ "dynamic_partition.prefix" = "p",
+ "dynamic_partition.buckets" = "10"
+);
+```
+
+Over time, the table will always maintain partitions within the range of
`[current date - 7, current date + 3]`. Dynamic Partition is particularly
useful for real-time data ingestion scenarios, such as when the ODS
(Operational Data Store) layer directly receives data from external sources
like Kafka.
+
+The `start` and `end` parameters define a fixed range for the partitions,
allowing the user to manage the partitions only within this range. However, if
the user needs to include more historical data, they would have to dial up the
`start` value, and that could lead to unnecessary metadata overhead in the
cluster.
+
+Therefore, when applying Dynamic Partition, there is a trade-off between the
convenience and efficiency of metadata management.
+
+## Developers' words
+
+As the complexity of business adds up, Dynamic Partition becomes inadequate
because:
+
+- It only supports partitioning by RANGE but not by LIST.
+
+- It can only be applied to the current real-world timestamps.
+
+- It only supports a single continuous partition range, and cannot accommodate
partitions outside of that range.
+
+Given these functional limitations, we started to plan a new partitioning
mechanism that can both automate partition management and simplify data table
maintenance.
+
+We figured out that the ideal partitioning implementation should:
+
+- Save the need for manually creating partitions after table creation;
+
+- Be able to accommodate all ingested data in corresponding partitions.
+
+**The former stands for automation and the latter for flexibility. The essence
of realizing them both is associating partition creation with the actual data.**
+
+Then we started to think about: What if we hold off the creation of partitions
until the data is ingested, rather than doing it during table creation or
through regular polling. Instead of pre-constructing the partition
distribution, we can define the "data-to-partition" mapping rules, so the
partitions are created after data arrives.
+
+Compared to Manual Partition, this whole process would be fully automated,
eliminating the need for human maintenance. Compared to Dynamic Partition, it
avoids having partitions that are not used, or partitions that are needed but
not present.
+
+## Auto Partition
+
+With [Apache Doris 2.1.0](https://doris.apache.org/blog/release-note-2.1.0),
we bring the above plan into fruition. During data ingestion, Doris creates
data partitions based on the configured rules. The Doris Backend nodes that are
responsible for data processing and distribution will attempt to find the
appropriate partition for each row of data in the DataSink operator of the
execution plan. It no longer filters out data that does not fit into any
existing partition or reports an erro [...]
+
+### Auto Partition by RANGE
+
+Auto Partition by RANGE provides an optimized partitioning solution based on
the time dimension. It is more flexible than Dynamic Partition in terms of
parameter configuration. The syntax for it is as follows:
+
+```sql
+AUTO PARTITION BY RANGE (FUNC_CALL_EXPR)
+()
+FUNC_CALL_EXPR ::= DATE_TRUNC ( <partition_column>, '<interval>' )
+```
+
+The `<partition_column>` above is the partition column (i.e., the column that
the partitioning is based on). `<interval>`specifies the partition unit, which
is the desired width of each partition.
+
+For example, if the partition column is `k0` and you want to partition by
month, the partition statement would be `AUTO PARTITION BY RANGE
(DATE_TRUNC(k0, 'month'))`. For all the imported data, the system will call
`DATE_TRUNC(k0, 'month')` to calculate the left endpoint of the partition, and
then the right endpoint by adding one `interval`.
+
+Now, we can apply Auto Partition to the `DAILY_TRADE_VALUE` table introduced
in the previous section on Dynamic Partition.
+
+```sql
+CREATE TABLE DAILY_TRADE_VALUE
+(
+ `TRADE_DATE` DATEV2 NOT NULL COMMENT 'Trade Date',
+ `TRADE_ID` VARCHAR(40) NOT NULL COMMENT 'Trade ID',
+ ......
+)
+AUTO PARTITION BY RANGE (DATE_TRUNC(`TRADE_DATE`, 'month'))
+()
+DISTRIBUTED BY HASH(`TRADE_DATE`) BUCKETS 10
+PROPERTIES
+(
+ ......
+);
+```
+
+After importing some data, these are the partitions we get:
+
+```sql
+mysql> show partitions from DAILY_TRADE_VALUE;
+Empty set (0.10 sec)
+
+mysql> insert into DAILY_TRADE_VALUE values ('2015-01-01', 1), ('2020-01-01',
2), ('2024-03-05', 10000), ('2024-03-06', 10001);
+Query OK, 4 rows affected (0.24 sec)
+{'label':'label_2a7353a3f991400e_ae731988fa2bc568', 'status':'VISIBLE',
'txnId':'85097'}
+
+mysql> show partitions from DAILY_TRADE_VALUE;
++-------------+-----------------+----------------+---------------------+--------+--------------+--------------------------------------------------------------------------------+-----------------+---------+----------------+---------------+---------------------+---------------------+--------------------------+----------+------------+-------------------------+-----------+--------------------+--------------+
+| PartitionId | PartitionName | VisibleVersion | VisibleVersionTime | State
| PartitionKey | Range
| DistributionKey | Buckets | ReplicationNum | StorageMedium |
CooldownTime | RemoteStoragePolicy | LastConsistencyCheckTime | DataSize
| IsInMemory | ReplicaAllocation | IsMutable | SyncWithBaseTables |
UnsyncTables |
++-------------+-----------------+----------------+---------------------+--------+--------------+--------------------------------------------------------------------------------+-----------------+---------+----------------+---------------+---------------------+---------------------+--------------------------+----------+------------+-------------------------+-----------+--------------------+--------------+
+| 588395 | p20150101000000 | 2 | 2024-06-01 19:02:40 |
NORMAL | TRADE_DATE | [types: [DATEV2]; keys: [2015-01-01]; ..types:
[DATEV2]; keys: [2015-02-01]; ) | TRADE_DATE | 10 | 1 |
HDD | 9999-12-31 23:59:59 | | NULL
| 0.000 | false | tag.location.default: 1 | true | true
| NULL |
+| 588437 | p20200101000000 | 2 | 2024-06-01 19:02:40 |
NORMAL | TRADE_DATE | [types: [DATEV2]; keys: [2020-01-01]; ..types:
[DATEV2]; keys: [2020-02-01]; ) | TRADE_DATE | 10 | 1 |
HDD | 9999-12-31 23:59:59 | | NULL
| 0.000 | false | tag.location.default: 1 | true | true
| NULL |
+| 588416 | p20240301000000 | 2 | 2024-06-01 19:02:40 |
NORMAL | TRADE_DATE | [types: [DATEV2]; keys: [2024-03-01]; ..types:
[DATEV2]; keys: [2024-04-01]; ) | TRADE_DATE | 10 | 1 |
HDD | 9999-12-31 23:59:59 | | NULL
| 0.000 | false | tag.location.default: 1 | true | true
| NULL |
++-------------+-----------------+----------------+---------------------+--------+--------------+--------------------------------------------------------------------------------+-----------------+---------+----------------+---------------+---------------------+---------------------+--------------------------+----------+------------+-------------------------+-----------+--------------------+--------------+
+3 rows in set (0.09 sec)
+```
+
+As is shown, partitions are automatically created for the imported data, and
it doesn't create partitions that are beyond the range of the existing data.
+
+### Auto Partition by LIST
+
+Auto Partition by LIST is to shard data based on non-time-based dimensions,
such as `region` and `department`. It fills that gap for Dynamic Partition,
which does not support data partitioning by LIST.
+
+Auto Partition by RANGE provides an optimized partitioning solution based on
the time dimension. It is more flexible than Dynamic Partition in terms of
parameter configuration. The syntax for it is as follows:
+
+```sql
+AUTO PARTITION BY LIST (`partition_col`)
+()
+```
+
+This is an example of Auto Partition by LIST using `city` as the partition
column:
+
+```SQL
+mysql> CREATE TABLE `str_table` (
+ -> `city` VARCHAR NOT NULL,
+ -> ......
+ -> )
+ -> DUPLICATE KEY(`city`)
+ -> AUTO PARTITION BY LIST (`city`)
+ -> ()
+ -> DISTRIBUTED BY HASH(`city`) BUCKETS 10
+ -> PROPERTIES (
+ -> ......
+ -> );
+Query OK, 0 rows affected (0.09 sec)
+
+mysql> insert into str_table values ("Denver"), ("Boston"), ("Los_Angeles");
+Query OK, 3 rows affected (0.25 sec)
+
+mysql> show partitions from str_table;
++-------------+-----------------+----------------+---------------------+--------+--------------+-------------------------------------------+-----------------+---------+----------------+---------------+---------------------+---------------------+--------------------------+----------+------------+-------------------------+-----------+--------------------+--------------+
+| PartitionId | PartitionName | VisibleVersion | VisibleVersionTime | State
| PartitionKey | Range | DistributionKey |
Buckets | ReplicationNum | StorageMedium | CooldownTime |
RemoteStoragePolicy | LastConsistencyCheckTime | DataSize | IsInMemory |
ReplicaAllocation | IsMutable | SyncWithBaseTables | UnsyncTables |
++-------------+-----------------+----------------+---------------------+--------+--------------+-------------------------------------------+-----------------+---------+----------------+---------------+---------------------+---------------------+--------------------------+----------+------------+-------------------------+-----------+--------------------+--------------+
+| 589685 | pDenver7 | 2 | 2024-06-01 20:12:37 |
NORMAL | city | [types: [VARCHAR]; keys: [Denver]; ] | city
| 10 | 1 | HDD | 9999-12-31 23:59:59 |
| NULL | 0.000 | false |
tag.location.default: 1 | true | true | NULL |
+| 589643 | pLos5fAngeles11 | 2 | 2024-06-01 20:12:37 |
NORMAL | city | [types: [VARCHAR]; keys: [Los_Angeles]; ] | city
| 10 | 1 | HDD | 9999-12-31 23:59:59 |
| NULL | 0.000 | false |
tag.location.default: 1 | true | true | NULL |
+| 589664 | pBoston8 | 2 | 2024-06-01 20:12:37 |
NORMAL | city | [types: [VARCHAR]; keys: [Boston]; ] | city
| 10 | 1 | HDD | 9999-12-31 23:59:59 |
| NULL | 0.000 | false |
tag.location.default: 1 | true | true | NULL |
++-------------+-----------------+----------------+---------------------+--------+--------------+-------------------------------------------+-----------------+---------+----------------+---------------+---------------------+---------------------+--------------------------+----------+------------+-------------------------+-----------+--------------------+--------------+
+3 rows in set (0.10 sec)
+```
+
+After inserting data for the cities of Denver, Boston, and Los Angeles, the
system automatically created corresponding partitions based on the city names.
Previously, this type of custom partitioning could only be achieved through
manual DDL statements. This is how Auto Partition by LIST simplifies database
maintenance.
+
+### Tips & notes
+
+**Manually adjust historical partitions**
+
+For tables that receive both real-time data and occasional historical updates,
since Auto Partition does not automatically reclaim historical partitions, we
recommend two options:
+
+- Use Auto Partition, which will automatically create partitions for the
occasional historical data updates.
+
+- Use Auto Partition and manually create a `LESS THAN` partition to
accommodate the historical updates. This allows for a clearer separation of
historical and real-time data, and makes data management easier.
+
+```sql
+mysql> CREATE TABLE DAILY_TRADE_VALUE
+ -> (
+ -> `TRADE_DATE` DATEV2 NOT NULL COMMENT 'Trade Date',
+ -> `TRADE_ID` VARCHAR(40) NOT NULL COMMENT 'Trade ID'
+ -> )
+ -> AUTO PARTITION BY RANGE (DATE_TRUNC(`TRADE_DATE`, 'DAY'))
+ -> (
+ -> PARTITION `pHistory` VALUES LESS THAN ("2024-01-01")
+ -> )
+ -> DISTRIBUTED BY HASH(`TRADE_DATE`) BUCKETS 10
+ -> PROPERTIES
+ -> (
+ -> "replication_num" = "1"
+ -> );
+Query OK, 0 rows affected (0.11 sec)
+
+mysql> insert into DAILY_TRADE_VALUE values ('2015-01-01', 1), ('2020-01-01',
2), ('2024-03-05', 10000), ('2024-03-06', 10001);
+Query OK, 4 rows affected (0.25 sec)
+{'label':'label_96dc3d20c6974f4a_946bc1a674d24733', 'status':'VISIBLE',
'txnId':'85092'}
+
+mysql> show partitions from DAILY_TRADE_VALUE;
++-------------+-----------------+----------------+---------------------+--------+--------------+--------------------------------------------------------------------------------+-----------------+---------+----------------+---------------+---------------------+---------------------+--------------------------+----------+------------+-------------------------+-----------+--------------------+--------------+
+| PartitionId | PartitionName | VisibleVersion | VisibleVersionTime | State
| PartitionKey | Range
| DistributionKey | Buckets | ReplicationNum | StorageMedium |
CooldownTime | RemoteStoragePolicy | LastConsistencyCheckTime | DataSize
| IsInMemory | ReplicaAllocation | IsMutable | SyncWithBaseTables |
UnsyncTables |
++-------------+-----------------+----------------+---------------------+--------+--------------+--------------------------------------------------------------------------------+-----------------+---------+----------------+---------------+---------------------+---------------------+--------------------------+----------+------------+-------------------------+-----------+--------------------+--------------+
+| 577871 | pHistory | 2 | 2024-06-01 08:53:49 |
NORMAL | TRADE_DATE | [types: [DATEV2]; keys: [0000-01-01]; ..types:
[DATEV2]; keys: [2024-01-01]; ) | TRADE_DATE | 10 | 1 |
HDD | 9999-12-31 23:59:59 | | NULL
| 0.000 | false | tag.location.default: 1 | true | true
| NULL |
+| 577940 | p20240305000000 | 2 | 2024-06-01 08:53:49 |
NORMAL | TRADE_DATE | [types: [DATEV2]; keys: [2024-03-05]; ..types:
[DATEV2]; keys: [2024-03-06]; ) | TRADE_DATE | 10 | 1 |
HDD | 9999-12-31 23:59:59 | | NULL
| 0.000 | false | tag.location.default: 1 | true | true
| NULL |
+| 577919 | p20240306000000 | 2 | 2024-06-01 08:53:49 |
NORMAL | TRADE_DATE | [types: [DATEV2]; keys: [2024-03-06]; ..types:
[DATEV2]; keys: [2024-03-07]; ) | TRADE_DATE | 10 | 1 |
HDD | 9999-12-31 23:59:59 | | NULL
| 0.000 | false | tag.location.default: 1 | true | true
| NULL |
++-------------+-----------------+----------------+---------------------+--------+--------------+--------------------------------------------------------------------------------+-----------------+---------+----------------+---------------+---------------------+---------------------+--------------------------+----------+------------+-------------------------+-----------+--------------------+--------------+
+3 rows in set (0.10 sec)
+```
+
+**NULL partition**
+
+With Auto Partition by LIST, Doris supports storing NULL values in NULL
partitions. For example:
+
+```sql
+mysql> CREATE TABLE list_nullable
+ -> (
+ -> `str` varchar NULL
+ -> )
+ -> AUTO PARTITION BY LIST (`str`)
+ -> ()
+ -> DISTRIBUTED BY HASH(`str`) BUCKETS auto
+ -> PROPERTIES
+ -> (
+ -> "replication_num" = "1"
+ -> );
+Query OK, 0 rows affected (0.10 sec)
+
+mysql> insert into list_nullable values ('123'), (''), (NULL);
+Query OK, 3 rows affected (0.24 sec)
+{'label':'label_f5489769c2f04f0d_bfb65510f9737fff', 'status':'VISIBLE',
'txnId':'85089'}
+
+mysql> show partitions from list_nullable;
++-------------+---------------+----------------+---------------------+--------+--------------+------------------------------------+-----------------+---------+----------------+---------------+---------------------+---------------------+--------------------------+----------+------------+-------------------------+-----------+--------------------+--------------+
+| PartitionId | PartitionName | VisibleVersion | VisibleVersionTime | State
| PartitionKey | Range | DistributionKey | Buckets
| ReplicationNum | StorageMedium | CooldownTime | RemoteStoragePolicy |
LastConsistencyCheckTime | DataSize | IsInMemory | ReplicaAllocation |
IsMutable | SyncWithBaseTables | UnsyncTables |
++-------------+---------------+----------------+---------------------+--------+--------------+------------------------------------+-----------------+---------+----------------+---------------+---------------------+---------------------+--------------------------+----------+------------+-------------------------+-----------+--------------------+--------------+
+| 577297 | pX | 2 | 2024-06-01 08:19:21 | NORMAL
| str | [types: [VARCHAR]; keys: [NULL]; ] | str | 10
| 1 | HDD | 9999-12-31 23:59:59 | |
NULL | 0.000 | false | tag.location.default: 1 |
true | true | NULL |
+| 577276 | p0 | 2 | 2024-06-01 08:19:21 | NORMAL
| str | [types: [VARCHAR]; keys: []; ] | str | 10
| 1 | HDD | 9999-12-31 23:59:59 | |
NULL | 0.000 | false | tag.location.default: 1 |
true | true | NULL |
+| 577255 | p1233 | 2 | 2024-06-01 08:19:21 | NORMAL
| str | [types: [VARCHAR]; keys: [123]; ] | str | 10
| 1 | HDD | 9999-12-31 23:59:59 | |
NULL | 0.000 | false | tag.location.default: 1 |
true | true | NULL |
++-------------+---------------+----------------+---------------------+--------+--------------+------------------------------------+-----------------+---------+----------------+---------------+---------------------+---------------------+--------------------------+----------+------------+-------------------------+-----------+--------------------+--------------+
+3 rows in set (0.11 sec)
+```
+
+However, Auto Partition by RANGE does not support NULL partitions, because the
NULL values will be stored in the smallest `LESS THAN` partition, and it is
impossible to reliably determine the appropriate range for it. If Auto
Partition were to create a NULL partition with a range of (-INFINITY,
MIN_VALUE), there would be a risk of this partition being inadvertently deleted
in production, as the MIN_VALUE boundary may not accurately represent the
intended business logic.
+
+### Summary
+
+Auto Partition covers most of the use cases of Dynamic Partition, while
introducing the benefit of upfront partition rule definition. Once the rules
are defined, the bulk of partition creation work is automatically handled by
Doris instead of a DBA.
+
+Before utilizing Auto Partition, it's important to understand the relevant
limitations:
+
+1. Auto Partition by LIST supports partitioning based on **multiple columns**,
but each automatically created partition only contains one single value, and
the partition name cannot exceed 50 characters in length. Note that the
partition names follow specific naming conventions, which have particular
implications for metadata management. That means not all of the 50-character
space is at the user's disposal.
+
+2. Auto Partition by RANGE only supports a **single partition column**, which
must be of type **DATE** or **DATETIME**.
+
+3. Auto Partition by LIST supports **NULLABLE** partition column and inserting
NULL values. Auto Partition by RANGE does not support NULLABLE partition column.
+
+4. It is not recommended to use Auto Partition in conjunction with Dynamic
Partition after Apache Doris 2.1.3.
+
+## Performance comparison
+
+The main functional differences between Auto Partition and Dynamic Partition
lie in partition creation and deletion, supported partition types, and their
impact on import performance.
+
+Dynamic Partition uses fixed threads to periodically create and reclaim
partitions. It only supports partitioning by RANGE. In contrast, Auto Partition
supports both partitioning by RANGE and by LIST. It automatically creates
partitions on-demand based on specific rules during data ingestion, providing a
higher level of automation and flexibility.
+
+Dynamic Partition does not slow down data ingestion speed, while Auto
Partition causes certain time overheads because it firstly checks for existing
partitions and then creates new ones on demand. We will present the performance
test results.
+
+
+
+## Auto Partition: ingestion workflow
+
+This part is about how data ingestion is implemented with the Auto Partition
mechanism, and we use [Stream
Load](https://doris.apache.org/docs/data-operate/import/stream-load-manual) as
an example. When Doris initiates a data import, one of the Doris Backend nodes
takes on the role of the Coordinator. It is responsible for the initial data
processing work and then dispatching the data to the appropriate BE nodes,
known as the Executors, for execution.
+
+
+
+
+In the final Datasink Node of the Coordinator's execution pipeline, the data
needs to be routed to the correct partitions, buckets, and Doris Backend node
locations before it can be successfully transmitted and stored.
+
+To enable this data transfer, the Coordinator and Executor nodes establish a
communication channels:
+
+- The sending end is called the Node Channel.
+
+- The receiving end is called the Tablets Channel.
+
+This is how Auto Partition comes into play during the process of determining
the correct partitions for the data:
+
+
+
+
+Previously, without Auto Partition, when a table does not have the required
partition, the behavior in Doris is for the BE nodes to accumulate errors until
a `DATA_QUALITY_ERROR` is reported. Now, with Auto Partition enabled, a request
will be initiated to the Doris Frontend to create the necessary partition
on-the-fly. After the partition creation transaction is completed, the Doris
Frontend responds to the Coordinator, which then opens the corresponding
communication channels (Node Cha [...]
+
+In a real-world cluster environment, the time spent by the Coordinator waiting
for the Doris Frontend to complete partition creation can incur large
overheads. This is due to the inherent latency of Thrift RPC calls, as well as
lock contention on the Frontend under high load conditions.
+
+To improve the data ingestion efficiency in Auto Partition, Doris has
implemented batching to largely reduce the number of RPC calls made to the FE.
This brings a notable performance enhancement for data write operations.
+
+Note that when the FE Master completes the partition creation transaction, the
new partition becomes immediately visible. However, if the import process
ultimately fails or is canceled, the created partitions are not automatically
reclaimed.
+
+## Auto Partition performance
+
+We tested the performance and stability of Auto Partition in Doris, covering
different use cases:
+
+**Case 1**: 1 Frontend + 3 Backend; 6 randomly generated datasets, each having
100 million rows and 2,000 partitions; ingested the 6 datasets concurrently
into 6 tables
+
+- **Objective**: Evaluate the performance of Auto Partition under high
pressure and check for any performance degradation.
+
+- **Results**: Auto Partition brings an **average performance loss less than
5%**, with all import transactions running stably.
+
+
+
+**Case 2**: 1 Frontend + 3 Backend; ingesting 100 rows per second from Flink
by Routine Load; testing with 1, 10, and 20 concurrent transactions (tables),
respectively
+
+- **Objective**: Identify any potential or data backlog issues that could
arise with Auto Partition under different concurrency levels.
+
+- **Results**: With or without Auto Partition enabled, the data ingestion was
successful without any backpressure issues across all the concurrency levels
tested, even at 20 concurrent transactions when the CPU utilization reached
close to 100%.
+
+
+
+
+To conclude the results of these tests, the impact of enabling Auto Partition
on data ingestion performance is minimal.
+
+## Conclusion and future plans
+
+Auto Partition has simplified DDL and partition management since Apache Doris
2.1.0. It is useful in large-scale data processing and makes it easy for users
to migrate from other database systems to Apache Doris.
+
+Moreover, we are committed to expanding the capabilities of Auto Partition to
support more complex data types.
+
+Plans for Auto Partition by RANGE:
+
+- Support numeric values;
+
+- Allowing users to specify the left and right boundaries of the partition
range.
+
+Plans for Auto Partition by LIST:
+
+- Allow merging multiple values into the same partition based on specific
rules.
+
+Join [Apache Doris open-source
community](https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-2gmq5o30h-455W226d79zP3L96ZhXIoQ)
for more information and further guidance.
\ No newline at end of file
diff --git a/blog/evolution-of-the-apache-doris-execution-engine.md
b/blog/evolution-of-the-apache-doris-execution-engine.md
index f9d3255e177..e2ae4d63e44 100644
--- a/blog/evolution-of-the-apache-doris-execution-engine.md
+++ b/blog/evolution-of-the-apache-doris-execution-engine.md
@@ -6,8 +6,6 @@
'date': '2024-06-18',
'author': 'Apache Doris',
'tags': ['Tech Sharing'],
- 'picked': "true",
- 'order': "4",
"image": '/images/evolution-of-the-apache-doris-execution-engine.jpg'
}
diff --git a/blog/migrate-lakehouse-from-bigquery-to-doris.md
b/blog/migrate-lakehouse-from-bigquery-to-doris.md
index dbbe238a521..536ae4727a6 100644
--- a/blog/migrate-lakehouse-from-bigquery-to-doris.md
+++ b/blog/migrate-lakehouse-from-bigquery-to-doris.md
@@ -7,7 +7,7 @@
'author': 'Dien, Tran Thanh',
'tags': ['Best Practice'],
'picked': "true",
- 'order': "2",
+ 'order': "3",
"image": '/images/migrate-lakehouse-from-bigquery-to-apache-doris.jpg'
}
diff --git a/blog/release-note-2.0.14.md b/blog/release-note-2.0.14.md
index 083809df92d..1e52d95e26e 100644
--- a/blog/release-note-2.0.14.md
+++ b/blog/release-note-2.0.14.md
@@ -7,7 +7,7 @@
'author': 'Apache Doris',
'tags': ['Release Notes'],
'picked': "true",
- 'order': "1",
+ 'order': "2",
"image": '/images/2.0.14.jpg'
}
---
diff --git a/src/components/recent-blogs/recent-blogs.data.ts
b/src/components/recent-blogs/recent-blogs.data.ts
index 3737e728480..36c0a344f39 100644
--- a/src/components/recent-blogs/recent-blogs.data.ts
+++ b/src/components/recent-blogs/recent-blogs.data.ts
@@ -1,4 +1,8 @@
export const RECENT_BLOGS_POSTS = [
+ {
+ label: 'Apache Doris version 2.0.14 has been released',
+ link: 'https://doris.apache.org/blog/release-note-2.0.14',
+ },
{
label: 'Migrate data lakehouse from BigQuery to Apache Doris, saving
$4,500 per month',
link:
'https://doris.apache.org/blog/migrate-lakehouse-from-bigquery-to-doris',
@@ -10,9 +14,5 @@ export const RECENT_BLOGS_POSTS = [
{
label: `Steps to industry-leading query speed: evolution of the Apache
Doris execution engine`,
link:
'https://doris.apache.org/blog/evolution-of-the-apache-doris-execution-engine',
- },
- {
- label: `Another lifesaver for data engineers: Apache Doris Job
Scheduler for task automation`,
- link:
'https://doris.apache.org/blog/job-scheduler-for-task-automation',
- },
+ }
];
diff --git a/src/constant/newsletter.data.ts b/src/constant/newsletter.data.ts
index 658e18e1359..a063afa1c6e 100644
--- a/src/constant/newsletter.data.ts
+++ b/src/constant/newsletter.data.ts
@@ -1,4 +1,11 @@
export const NEWSLETTER_DATA = [
+ {
+ tags: ['Tech Sharing'],
+ title: "Automatic and flexible data sharding: Auto Partition in Apache
Doris",
+ content: `Apache Doris 2.1.0 introduces Auto Partition. It supports
partitioning data by RANGE or by LIST and further enhances flexibility on top
of automatic partitioning.`,
+ to: '/blog/auto-partition-in-apache-doris',
+ image: 'auto-partition-in-apache-doris.jpg',
+ },
{
tags: ['Release Note'],
title: "Apache Doris version 2.0.14 has been released",
@@ -20,12 +27,6 @@ export const NEWSLETTER_DATA = [
to: '/blog/apache-doris-vs-rockset',
image: 'doris-vs-rockset.jpeg',
},
- {
- tags: ['Recent Events'],
- title: "Steps to industry-leading query speed: evolution of the Apache
Doris execution engine",
- content: `From the Volcano Model to the Pipeline Execution Engine, and
now PipelineX, Apache Doris brings its computation efficiency to a higher level
with each iteration.`,
- to: '/blog/evolution-of-the-apache-doris-execution-engine',
- image: 'evolution-of-the-apache-doris-execution-engine.jpg',
- },
+
];
diff --git a/static/images/auto-partition-in-apache-doris.jpg
b/static/images/auto-partition-in-apache-doris.jpg
new file mode 100644
index 00000000000..bc26f576ba0
Binary files /dev/null and b/static/images/auto-partition-in-apache-doris.jpg
differ
diff --git a/static/images/auto-partition-ingestion-workflow-2.png
b/static/images/auto-partition-ingestion-workflow-2.png
new file mode 100644
index 00000000000..bda96e58226
Binary files /dev/null and
b/static/images/auto-partition-ingestion-workflow-2.png differ
diff --git a/static/images/auto-partition-ingestion-workflow.png
b/static/images/auto-partition-ingestion-workflow.png
new file mode 100644
index 00000000000..bbcee300280
Binary files /dev/null and
b/static/images/auto-partition-ingestion-workflow.png differ
diff --git a/static/images/auto-partition-performance-2.png
b/static/images/auto-partition-performance-2.png
new file mode 100644
index 00000000000..e281550a647
Binary files /dev/null and b/static/images/auto-partition-performance-2.png
differ
diff --git a/static/images/auto-partition-performance.png
b/static/images/auto-partition-performance.png
new file mode 100644
index 00000000000..0c61cf2c6f7
Binary files /dev/null and b/static/images/auto-partition-performance.png differ
diff --git a/static/images/evolution-of-partitioning-strategies-in-Doris.png
b/static/images/evolution-of-partitioning-strategies-in-Doris.png
new file mode 100644
index 00000000000..d556aa55a55
Binary files /dev/null and
b/static/images/evolution-of-partitioning-strategies-in-Doris.png differ
diff --git a/static/images/performance-comparison.png
b/static/images/performance-comparison.png
new file mode 100644
index 00000000000..926702b39cd
Binary files /dev/null and b/static/images/performance-comparison.png differ
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]