This is an automated email from the ASF dual-hosted git repository.
luzhijing pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new 13bcd109472d [docs](doc)Update en doc (#461)
13bcd109472d is described below
commit 13bcd109472d8967882fd2bd0f1d8f01230a8841
Author: Hu Yanjun <[email protected]>
AuthorDate: Sat Mar 23 14:48:57 2024 +0800
[docs](doc)Update en doc (#461)
---
static/images/aggregate-key-model-example.png | Bin 0 -> 355467 bytes
static/images/create-table-example.png | Bin 0 -> 390782 bytes
static/images/duplicate-key-model-example.png | Bin 0 -> 372902 bytes
.../version-2.0/admin-manual/data-admin/ccr.md | 4 +-
.../admin-manual/query-admin/kill-query.md | 61 ++-
.../data-operate/delete/atomicity-replace.md | 43 +-
.../data-operate/delete/truncate-manual.md | 32 +-
.../data-operate/update/aggregate-load-update.md | 60 ++-
.../version-2.0/db-connect/database-connect.md | 6 +-
.../cluster-deployment/standard-deployment.md | 48 +-
.../install/source-install/compilation-arm.md | 4 +-
.../install/source-install/compilation-mac.md | 12 +-
.../compilation-with-ldb-toolchain.md | 12 +-
.../version-2.0/lakehouse/datalake/hive.md | 525 ++++++++++++---------
.../version-2.0/table-design/best-practice.md | 483 +++++++++++++++----
15 files changed, 937 insertions(+), 353 deletions(-)
diff --git a/static/images/aggregate-key-model-example.png
b/static/images/aggregate-key-model-example.png
new file mode 100644
index 000000000000..2474a1bcc938
Binary files /dev/null and b/static/images/aggregate-key-model-example.png
differ
diff --git a/static/images/create-table-example.png
b/static/images/create-table-example.png
new file mode 100644
index 000000000000..469c79842ec1
Binary files /dev/null and b/static/images/create-table-example.png differ
diff --git a/static/images/duplicate-key-model-example.png
b/static/images/duplicate-key-model-example.png
new file mode 100644
index 000000000000..3935adb30410
Binary files /dev/null and b/static/images/duplicate-key-model-example.png
differ
diff --git a/versioned_docs/version-2.0/admin-manual/data-admin/ccr.md
b/versioned_docs/version-2.0/admin-manual/data-admin/ccr.md
index 7f16c6822bb2..3bd8e54a836f 100644
--- a/versioned_docs/version-2.0/admin-manual/data-admin/ccr.md
+++ b/versioned_docs/version-2.0/admin-manual/data-admin/ccr.md
@@ -562,7 +562,9 @@ download_binlog_rate_limit_kbs=1024 # This configuration
limits the size to 1MB.
## IS_BEING_SYNCED
-:::tip Doris v2.0 "is_being_synced" = "true" :::
+:::tip
+Doris v2.0 "is_being_synced" = "true"
+:::
During data synchronization using CCR, replica tables (referred to as target
tables) are created in the target cluster for the tables within the
synchronization scope of the source cluster (referred to as source tables).
However, certain functionalities and attributes need to be disabled or cleared
when creating replica tables to ensure the correctness of the synchronization
process. For example:
diff --git a/versioned_docs/version-2.0/admin-manual/query-admin/kill-query.md
b/versioned_docs/version-2.0/admin-manual/query-admin/kill-query.md
index 532ac3e62629..0e0ec8ecddd4 100644
--- a/versioned_docs/version-2.0/admin-manual/query-admin/kill-query.md
+++ b/versioned_docs/version-2.0/admin-manual/query-admin/kill-query.md
@@ -24,4 +24,63 @@ specific language governing permissions and limitations
under the License.
-->
-Coming Soon
\ No newline at end of file
+# Kill Query
+## Kill connection
+
+In Doris, each connection runs in a separate thread. You can terminate a
thread using the `KILL processlist_id`statement.
+
+The `processlist_id` for the thread can be found in the Id column from the
SHOW PROCESSLIST output. Or you can use the `SELECT CONNECTION_ID()` command to
query the current connection id.
+
+Syntax:
+
+```SQL
+KILL [CONNECTION] processlist_id
+```
+
+## Kill query
+
+You can also terminate the query command under execution based on the
processlist_id or the query_id.
+
+Syntax:
+
+```SQL
+KILL QUERY processlist_id | query_id
+```
+
+## Example
+
+1. Check the current connection id.
+
+```SQL
+mysql select connection_id();
++-----------------+
+| connection_id() |
++-----------------+
+| 48 |
++-----------------+
+1 row in set (0.00 sec)
+```
+
+2. Check all connection id.
+
+```SQL
+mysql SHOW PROCESSLIST;
++------------------+------+------+--------------------+---------------------+----------+---------+---------+------+-------+-----------------------------------+---------------------------------------------------------------------------------------+
+| CurrentConnected | Id | User | Host | LoginTime |
Catalog | Db | Command | Time | State | QueryId
| Info
|
++------------------+------+------+--------------------+---------------------+----------+---------+---------+------+-------+-----------------------------------+---------------------------------------------------------------------------------------+
+| Yes | 48 | root | 10.16.xx.xx:44834 | 2023-12-29 16:49:47 |
internal | test | Query | 0 | OK | e6e4ce9567b04859-8eeab8d6b5513e38 |
SHOW PROCESSLIST
|
+| | 50 | root | 192.168.xx.xx:52837 | 2023-12-29 16:51:34 |
internal | | Sleep | 1837 | EOF | deaf13c52b3b4a3b-b25e8254b50ff8cb |
SELECT @@session.transaction_isolation
|
+| | 51 | root | 192.168.xx.xx:52843 | 2023-12-29 16:51:35 |
internal | | Sleep | 907 | EOF | 437f219addc0404f-9befe7f6acf9a700 |
/* ApplicationName=DBeaver Ultimate 23.1.3 - Metadata */ SHOW STATUS
|
+| | 55 | root | 192.168.xx.xx:55533 | 2023-12-29 17:09:32 |
internal | test | Sleep | 271 | EOF | f02603dc163a4da3-beebbb5d1ced760c |
/* ApplicationName=DBeaver Ultimate 23.1.3 - SQLEditor <Console> */ SELECT
DATABASE() |
+| | 47 | root | 10.16.xx.xx:35678 | 2023-12-29 16:21:56 |
internal | test | Sleep | 3528 | EOF | f4944c543dc34a99-b0d0f3986c8f1c98 |
select * from test
|
++------------------+------+------+--------------------+---------------------+----------+---------+---------+------+-------+-----------------------------------+---------------------------------------------------------------------------------------+
+5 rows in set (0.00 sec)
+```
+
+3. Kill the currently running query, which will then be displayed as canceled.
+
+```SQL
+mysql kill query 55;
+Query OK, 0 rows affected (0.01 sec)
+```
+
diff --git
a/versioned_docs/version-2.0/data-operate/delete/atomicity-replace.md
b/versioned_docs/version-2.0/data-operate/delete/atomicity-replace.md
index a3da6b10cb0c..d3a2b9a867fc 100644
--- a/versioned_docs/version-2.0/data-operate/delete/atomicity-replace.md
+++ b/versioned_docs/version-2.0/data-operate/delete/atomicity-replace.md
@@ -24,4 +24,45 @@ specific language governing permissions and limitations
under the License.
-->
-Coming Soon
\ No newline at end of file
+# Atomicity Replace
+
+Doris supports atomic table replacement operations for two tables. This is
only applicable to OLAP tables.
+
+## Applicable scenarios
+
+- Atomic overwrite operations
+- In certain cases, users may want to rewrite data in a table. However, the
"delete and load" approach causes a data invisibility window. To solve that,
Doris allows users to create a new table of the same schema using the CREATE
TABLE LIKE statement, import the new data into this new table, and then
atomically replace the old table with the new table. For atomic replacement at
the partition level, please refer to the [temporary
partition](https://doris.apache.org/docs/2.0/data-operate/d [...]
+
+## Syntax
+
+```Plain
+ALTER TABLE [db.]tbl1 REPLACE WITH TABLE tbl2
+[PROPERTIES('swap' = 'true')];
+```
+
+Replace table tbl1 with table tbl2.
+
+If `swap` is `true`, after the replacement, data in `tbl1` will be replaced by
that in `tbl2`, while data in `tbl2` will be replaced by that in `tbl1`. In
other words, the two tables will swap data.
+
+If `swap` is `false`, after the replacement, data in `tbl1` will be replaced
by that in `tbl2` and `tbl2` will be deleted.
+
+## Implementation
+
+In fact, table replacement is to combine the following operations into one
atomic operation.
+
+Assuming that table A is to be replaced with table B, and `swap` is set to
`true`. The operations to be implemented are as follows:
+
+1. Rename table B to table A.
+2. Rename table A to table B.
+
+If `swap` is set to `false`, the operations are as follows:
+
+1. Delete table A.
+2. Rename table B to table A.
+
+## Note
+
+- `swap` defaults to `true`, meaning to swap the data between two tables.
+- If `swap` is set to `false`, the table being replaced (table A) will be
deleted and cannot be recovered.
+- The replacement operation can only be implemented between two OLAP tables
and it does not check for table schema consistency.
+- The replacement operation does not change the existing privilege settings
because privilege checks are based on table names.
diff --git a/versioned_docs/version-2.0/data-operate/delete/truncate-manual.md
b/versioned_docs/version-2.0/data-operate/delete/truncate-manual.md
index 9debf3729078..a02bd205f244 100644
--- a/versioned_docs/version-2.0/data-operate/delete/truncate-manual.md
+++ b/versioned_docs/version-2.0/data-operate/delete/truncate-manual.md
@@ -24,4 +24,34 @@ specific language governing permissions and limitations
under the License.
-->
-Coming Soon
\ No newline at end of file
+# Truncate
+
+This statement is used to clear the data of a specified table or partition in
Doris
+
+## Syntax
+
+```SQL
+TRUNCATE TABLE [db.]tbl[ PARTITION(p1, p2, ...)];
+```
+
+- This statement only clears the data within a table or partition but
preserves the table or partition itself.
+- Unlike DELETE, this statement can only clear the specified table or
partition as a whole and cannot be added with filter conditions.
+- Unlike DELETE, truncating data will not affect query performance.
+- The data deleted by this operation is not recoverable.
+- When using this command, the table status must be NORMAL, which means that
tables undergoing SCHEMA CHANGE can not be truncated.
+- This command may cause ongoing imports to fail.
+
+## Example
+
+1. Truncate the table `tbl` under `example_db`.
+
+```SQL
+TRUNCATE TABLE example_db.tbl;
+```
+
+2. Truncate partitions `p1` and `p2` of table `tbl`.
+
+```SQL
+TRUNCATE TABLE tbl PARTITION(p1, p2);
+```
+
diff --git
a/versioned_docs/version-2.0/data-operate/update/aggregate-load-update.md
b/versioned_docs/version-2.0/data-operate/update/aggregate-load-update.md
index 1514201f4190..e66af6ae004d 100644
--- a/versioned_docs/version-2.0/data-operate/update/aggregate-load-update.md
+++ b/versioned_docs/version-2.0/data-operate/update/aggregate-load-update.md
@@ -24,4 +24,62 @@ specific language governing permissions and limitations
under the License.
-->
-Coming soon
\ No newline at end of file
+# Update for Aggregate Load
+
+This guide is about ingestion-based data updates for the Aggregate Key model
in Doris.
+
+## Update all columns
+
+When importing data into an Aggregate Key model in Doris by methods like
Stream Load, Broker Load, Routine Load, and Insert Into, the new values are
combined with the old values to produce new aggregated values based on the
column's aggregation function. These values might be generated during insertion
or produced asynchronously during compaction. However, when querying, users
will always receive the same returned values.
+
+## Partial column update for Aggregate Key model
+
+Tables in the Aggregate Key model are primarily used in cases with
pre-aggregation requirements rather than data updates, but Doris allows partial
column updates for them, too. Simply set the aggregation function to
`REPLACE_IF_NOT_NULL`.
+
+**Create table**
+
+For the columns that need to be updated, set the aggregation function to
`REPLACE_IF_NOT_NULL`.
+
+```Plain
+CREATE TABLE order_tbl (
+ order_id int(11) NULL,
+ order_amount int(11) REPLACE_IF_NOT_NULL NULL,
+ order_status varchar(100) REPLACE_IF_NOT_NULL NULL
+) ENGINE=OLAP
+AGGREGATE KEY(order_id)
+COMMENT 'OLAP'
+DISTRIBUTED BY HASH(order_id) BUCKETS 1
+PROPERTIES (
+"replication_allocation" = "tag.location.default: 1"
+);
++----------+--------------+-----------------+
+| order_id | order_amount | order_status |
++----------+--------------+-----------------+
+| 1 | 100 | Pending Payment |
++----------+--------------+-----------------+
+1 row in set (0.01 sec)
+```
+
+**Ingest data**
+
+For Stream Load, Broker Load, Routine Load, or INSERT INTO, you can directly
write the updates to the fields.
+
+**Example**
+
+Using the same example as above, the corresponding Stream Load command would
be (no additional headers required):
+
+```Plain
+curl --location-trusted -u root: -H "column_separator:," -H
"columns:order_id,order_status" -T /tmp/update.csv
http://127.0.0.1:48037/api/db1/order_tbl/_stream_load
+```
+
+The corresponding `INSERT INTO` statement would be (no additional session
variables required):
+
+```Plain
+INSERT INTO order_tbl (order_id, order_status) values (1,'Delivery Pending');
+```
+
+## Note
+
+The Aggregate Key model does not perform additional data processing during
data writing, so the writing performance in this model is the same as other
models. However, aggregation during queries can result in performance loss.
Typical aggregation queries can be 5~10 times slower than queries on
Merge-on-Write tables in the Unique Key model.
+
+Under this circumstance, users cannot set a field from non-NULL to NULL,
because NULL values written will be automatically neglected by the
REPLACE_IF_NOT_NULL aggregation function.
diff --git a/versioned_docs/version-2.0/db-connect/database-connect.md
b/versioned_docs/version-2.0/db-connect/database-connect.md
index 8aa246e2b0f6..1b5bd7cb4ede 100644
--- a/versioned_docs/version-2.0/db-connect/database-connect.md
+++ b/versioned_docs/version-2.0/db-connect/database-connect.md
@@ -101,6 +101,8 @@ ALTER SYSTEM ADD BACKEND
"be_host_ip:heartbeat_service_port";

-:::tip For successful execution of statements that are not related to specific
databases/tables in the Playground, it is necessary to randomly select a
database from the left-hand database panel. This limitation will be removed
later.
+:::tip
+For successful execution of statements that are not related to specific
databases/tables in the Playground, it is necessary to randomly select a
database from the left-hand database panel. This limitation will be removed
later.
-The current built-in web console cannot execute SET type SQL statements.
Therefore, the web console does not support statements like SET PASSWORD FOR
'user' = PASSWORD('user_password'). :::
+The current built-in web console cannot execute SET type SQL statements.
Therefore, the web console does not support statements like SET PASSWORD FOR
'user' = PASSWORD('user_password').
+:::
diff --git
a/versioned_docs/version-2.0/install/cluster-deployment/standard-deployment.md
b/versioned_docs/version-2.0/install/cluster-deployment/standard-deployment.md
index 6b6ea3fa0501..7536e8117180 100644
---
a/versioned_docs/version-2.0/install/cluster-deployment/standard-deployment.md
+++
b/versioned_docs/version-2.0/install/cluster-deployment/standard-deployment.md
@@ -83,13 +83,14 @@ Doris can run on servers with x86-64 architecture or ARM64
architecture.
| Frontend | 8+ Cores | 8+ GB | SSD or SATA, 10+ GB | 1 or 10 GbE | 1
|
| Backend | 8+ Cores | 16+ GB | SSD or SATA, 50+ GB | 1 or 10 GbE | 1
|
-:::tip Description:
-
+:::tip
+Description:
- In a validation testing environment, you can deploy the FE and BE on the
same server.
- It is generally recommended to deploy only **one BE instance on a single
machine**. Meanwhile, you can only deploy **one** **FE** **on one machine**.
- If you need three replicas of data, you will need at least three machines
and deploy one BE instance on each of them, instead of deploying three BE
instances on one single machine.
- **The clocks of multiple servers hosting FEs must be synchronized, allowing
a maximum clock deviation of 5 seconds.**
-- In a testing environment, you can also perform tests with just one BE. In
the actual production environment, the number of BE instances directly affects
the overall query latency. :::
+- In a testing environment, you can also perform tests with just one BE. In
the actual production environment, the number of BE instances directly affects
the overall query latency.
+:::
**Production environment**
@@ -98,11 +99,12 @@ Doris can run on servers with x86-64 architecture or ARM64
architecture.
| Frontend | 16+ Cores | 64+ GB | SSD or RAID, 100+ GB | 10 GbE | 1
|
| Backend | 16+ Cores | 64+ GB | SSD or SATA, 100+ GB | 10 GbE | 3
|
-:::tip Description:
-
+:::tip
+Description:
- In a production environment, if the FE and BE are co-located, be mindful of
resource contention issues. It is recommended to store metadata and data on
separate disks.
- BE nodes can be configured with multiple disks for storage. You can bind
multiple HDDs or SSDs to a single BE instance.
-- The performance of the cluster is dependent on the resources of the BE
nodes. The more BE nodes there are, the better the performance of Doris.
Typically, Doris can fully leverage its performance potential with 10 to 100
machines. :::
+- The performance of the cluster is dependent on the resources of the BE
nodes. The more BE nodes there are, the better the performance of Doris.
Typically, Doris can fully leverage its performance potential with 10 to 100
machines.
+:::
### Hard disk space calculation
@@ -137,7 +139,9 @@ tmpfs /tmp tmpfs nodev,nosuid
0 0
/dev/sda3 /home ext4 defaults,noatime 0 2
```
-:::caution It is not recommended to disable swap by setting vm.swappiness = 0
because this parameter has different semantics in different Linux kernel
versions. In many cases, this will not completely disable swap. :::
+:::caution
+It is not recommended to disable swap by setting vm.swappiness = 0 because
this parameter has different semantics in different Linux kernel versions. In
many cases, this will not completely disable swap.
+:::
### Check and disable system firewall
@@ -169,7 +173,9 @@ vi /etc/security/limits.conf
* hard nofile 1000000
```
-:::caution For the changes to take effect, the current user needs to log out
of the current session and log back in. :::
+:::caution
+For the changes to take effect, the current user needs to log out of the
current session and log back in.
+:::
### Modify the number of virtual memory areas
@@ -326,10 +332,10 @@ ALTER SYSTEM ADD OBSERVER
"<fe_ip_address>:<fe_edit_log_port>"
```
:::note
-
1. It is recommended to have an odd number of FE Follower nodes (including the
Master node), with a suggested number of three to achieve high availability.
2. When FE is deployed in a high availability mode (1 Master node, 2 Follower
nodes), it is advisable to increase the scalability of the FE read services by
adding an Observer FE.
-3. Typically, one FE node can handle around 10-20 BE nodes. It is recommended
to keep the total number of FE nodes below 10. :::
+3. Typically, one FE node can handle around 10-20 BE nodes. It is recommended
to keep the total number of FE nodes below 10.
+:::
**Start** **FE** **Follower node**
@@ -387,13 +393,13 @@
storage_root_path=/home/disk1/doris,medium:HDD;/home/disk2/doris,medium:SSD
```
:::note
-
1. When specifying the storage type for the storage path, at least one path
should have the storage type set as HDD.
2. If the storage type for a storage path is not explicitly declared, it
defaults to HDD.
3. Specifying the storage type as HDD or SSD is to differentiate the storage
types for the paths. It is independent of the physical storage medium. For
example, you can designate a directory on an HDD disk as SSD.
-4. The keywords for storage types, HDD and SSD, must be capitalized. :::
+4. The keywords for storage types, HDD and SSD, must be capitalized.
+:::
-5. Bind cluster IP
+6. Bind cluster IP
In a multi-network interface environment, it is necessary to explicitly
configure the "priority_networks". However, in an environment where FQDN is
configured, the "priority_networks" can be ignored.
@@ -534,17 +540,21 @@ Doris processes listen to network segments represented in
CIDR format for IP add
This parameter is mainly used to help the system select the correct network
card IP as its listening IP. For example, if the required listening IP is
192.168.0.1, you can set `priority_networks=192.168.0.0/24`. The system will
automatically scan all IPs on the machine, and only those matching the
192.168.0.0/24 network segment will be used as service listening addresses. You
can also configure multiple CIDR network segments in this parameter, such as
`priority_networks = 10.10.0.0/16; 19 [...]
-:::tip **Why use priority_networks to configure listening address segments
instead of directly setting the listening IP address in the configuration
file?**
+:::tip
+**Why use priority_networks to configure listening address segments instead of
directly setting the listening IP address in the configuration file?**
-The main reason is that Doris is a distributed cluster, and the same
configuration file will be deployed on multiple nodes. To facilitate
deployment, updates, and maintenance, it is desirable to keep all nodes'
configuration files consistent. By configuring the listening address segment
and then starting the system, the appropriate listening IP can be found based
on this network segment, thus each machine can use a single value in this
configuration. :::
+The main reason is that Doris is a distributed cluster, and the same
configuration file will be deployed on multiple nodes. To facilitate
deployment, updates, and maintenance, it is desirable to keep all nodes'
configuration files consistent. By configuring the listening address segment
and then starting the system, the appropriate listening IP can be found based
on this network segment, thus each machine can use a single value in this
configuration.
+:::
### Why do new BE nodes need to be manually added to the cluster?
After the BE node is started, you need to send a command to the FE through the
MySQL protocol or the built-in web console to join the BE node to the cluster.
-:::tip **How does the FE know which BE nodes constitute the cluster?**
+:::tip
+**How does the FE know which BE nodes constitute the cluster?**
-As a distributed database, Doris generally has many BE nodes. Doris adds BE
nodes to the cluster by sending commands to the FE. This is different from the
way how BE nodes know the addresses of FE nodes and then actively report
connections. The way of manual adding and having the FE actively connect to BE
nodes is beneficial to cluster management in many aspects. For example, it can
determine which nodes constitute the cluster and it can proactively shut down a
BE node that cannot be con [...]
+As a distributed database, Doris generally has many BE nodes. Doris adds BE
nodes to the cluster by sending commands to the FE. This is different from the
way how BE nodes know the addresses of FE nodes and then actively report
connections. The way of manual adding and having the FE actively connect to BE
nodes is beneficial to cluster management in many aspects. For example, it can
determine which nodes constitute the cluster and it can proactively shut down a
BE node that cannot be con [...]
+:::
### How to quickly detect whether the FE has started successfully?
@@ -574,7 +584,9 @@ ALTER SYSTEM ADD BACKEND
"be_host_ip:heartbeat_service_port";

-:::tip For successful execution of statements that are not related to specific
databases/tables in the Playground, it is necessary to randomly select a
database from the left-hand database panel. This limitation will be removed
later. :::well-log-
+:::tip
+For successful execution of statements that are not related to specific
databases/tables in the Playground, it is necessary to randomly select a
database from the left-hand database panel. This limitation will be removed
later.
+:::
### Why can't I change the root password via the Web UI?
diff --git
a/versioned_docs/version-2.0/install/source-install/compilation-arm.md
b/versioned_docs/version-2.0/install/source-install/compilation-arm.md
index 0c45db114453..d60a711f546f 100644
--- a/versioned_docs/version-2.0/install/source-install/compilation-arm.md
+++ b/versioned_docs/version-2.0/install/source-install/compilation-arm.md
@@ -250,7 +250,9 @@ gcc --version
## Compile
-:::tip It is recommended to use LDB Toolchain for compilation with ARM
environment. :::
+:::tip
+It is recommended to use LDB Toolchain for compilation with ARM environment.
+:::
When compiling Doris on an ARM platform, please disable the AVX2 and LIBUNWIND
third-party libraries.
diff --git
a/versioned_docs/version-2.0/install/source-install/compilation-mac.md
b/versioned_docs/version-2.0/install/source-install/compilation-mac.md
index e0493452476a..0a3c7a0b1bb7 100644
--- a/versioned_docs/version-2.0/install/source-install/compilation-mac.md
+++ b/versioned_docs/version-2.0/install/source-install/compilation-mac.md
@@ -42,7 +42,9 @@ brew install automake autoconf libtool pkg-config texinfo
coreutils gnu-getopt \
python@3 cmake ninja ccache bison byacc gettext wget pcre maven llvm@16
openjdk@11 npm
```
-:::tip On MacOS, since Homebrew does not provide an installation package for
JDK8, JDK11 is used here instead. Alternatively, you can manually download and
install JDK8. :::
+:::tip
+On MacOS, since Homebrew does not provide an installation package for JDK8,
JDK11 is used here instead. Alternatively, you can manually download and
install JDK8.
+:::
2. **Compile source code**
@@ -50,7 +52,9 @@ python@3 cmake ninja ccache bison byacc gettext wget pcre
maven llvm@16 openjdk@
bash build.sh
```
-:::tip The first step of compiling Doris is to download and compile
third-party libraries. You can download the pre-compiled versions of
third-party libraries provided by the Doris community. Please refer to the
instructions below for downloading **pre-compiled third-party libraries** to
speed up the build process. :::
+:::tip
+The first step of compiling Doris is to download and compile third-party
libraries. You can download the pre-compiled versions of third-party libraries
provided by the Doris community. Please refer to the instructions below for
downloading **pre-compiled third-party libraries** to speed up the build
process.
+:::
## Start
@@ -107,4 +111,6 @@ cd installed/bin
./thrift --version
```
-:::tip When running protoc and thrift, you may encounter an issue where the
binary cannot be opened due to developer verification. To resolve this, you can
go to "Security & Privacy" settings. In the "General" tab, click on the "Open
Anyway" button to confirm your intent to open the binary. Refer to:
https://support.apple.com/en-us/102445 :::
+:::tip
+When running protoc and thrift, you may encounter an issue where the binary
cannot be opened due to developer verification. To resolve this, you can go to
"Security & Privacy" settings. In the "General" tab, click on the "Open Anyway"
button to confirm your intent to open the binary. Refer to:
https://support.apple.com/en-us/102445
+:::
diff --git
a/versioned_docs/version-2.0/install/source-install/compilation-with-ldb-toolchain.md
b/versioned_docs/version-2.0/install/source-install/compilation-with-ldb-toolchain.md
index aa3dd0b87612..7526fe78514c 100644
---
a/versioned_docs/version-2.0/install/source-install/compilation-with-ldb-toolchain.md
+++
b/versioned_docs/version-2.0/install/source-install/compilation-with-ldb-toolchain.md
@@ -28,9 +28,11 @@ under the License.
This guide is about how to compile Doris using the LDB Toolchain. This method
serves as a supplement to the Docker compilation approach to help the
developers and users without a Docker environment. The recommended LDB
Toolchain version is 0.17, which includes clang-16 and gcc-11.
-:::tip LDB Toolchain is fully known as Linux Distribution Based Toolchain
Generator. It helps compile modern C++ projects on almost all Linux
distributions. :::
+:::tip
+LDB Toolchain is fully known as Linux Distribution Based Toolchain Generator.
It helps compile modern C++ projects on almost all Linux distributions.
Special thanks to [Amos Bird](https://github.com/amosbird) for the
contribution.
+:::
## Prepare the compilation environment:
@@ -40,7 +42,9 @@ This method applies to most Linux distributions (CentOS,
Ubuntu, etc.).
Download the latest `ldb_toolchain_gen.sh` from
[here](https://github.com/amosbird/ldb_toolchain_gen/releases). This script is
used to generate ldb toolchain.
-:::tip For more information, please visit
https://github.com/amosbird/ldb_toolchain_gen :::
+:::tip
+For more information, please visit
https://github.com/amosbird/ldb_toolchain_gen
+:::
2. **Execute the following command to generate ldb toolchain.**
@@ -105,7 +109,9 @@ export PATH=/path/to/ldb_toolchain/bin:$PATH
## Compile Doris
-:::tip The first step of compiling Doris source code is to first download
third-party libraries and compile them. You can refer to the following
instructions to download precompiled versions of the third-party libraries. :::
+:::tip
+The first step of compiling Doris source code is to first download third-party
libraries and compile them. You can refer to the following instructions to
download precompiled versions of the third-party libraries.
+:::
1. **Enter the Doris source code directory and execute the following command
to check if the compilation machine supports the AVX2 instruction set.**
diff --git a/versioned_docs/version-2.0/lakehouse/datalake/hive.md
b/versioned_docs/version-2.0/lakehouse/datalake/hive.md
index 4724d0e903b2..81450b2b7ff3 100644
--- a/versioned_docs/version-2.0/lakehouse/datalake/hive.md
+++ b/versioned_docs/version-2.0/lakehouse/datalake/hive.md
@@ -26,22 +26,22 @@ under the License.
# Hive
-By connecting to Hive Metastore, or a metadata service compatible with Hive
Metatore, Doris can automatically obtain Hive database table information and
perform data queries.
+By connecting to Hive Metastore, Doris can automatically obtain database and
table information from Hive and perform data queries.
-In addition to Hive, many other systems also use the Hive Metastore to store
metadata. So through Hive Catalog, we can not only access Hive, but also access
systems that use Hive Metastore as metadata storage. Such as Iceberg, Hudi, etc.
+In addition to Hive, systems like Iceberg and Hudi also use Hive Metastore for
metadata storage. Hive Catalog in Doris allows users to easily integrate with
not only Hive but also systems that use Hive Metastore as metadata storage.
-## Terms and Conditions
+## Note
-1. Need to put core-site.xml, hdfs-site.xml and hive-site.xml in the conf
directory of FE and BE. First read the hadoop configuration file in the conf
directory, and then read the related to the environment variable
`HADOOP_CONF_DIR` configuration file.
-2. hive supports version 1/2/3.
-3. Support Managed Table and External Table and part of Hive View.
-4. Can identify hive, iceberg, hudi metadata stored in Hive Metastore.
+- Hive 1, Hive 2, and Hive 3 are supported.
+- Managed Tables, External Tables, and part of the Hive Views are supported.
+- It can recognize Hive, Iceberg, and Hudi metadata stored in the Hive
Metastore.
+- You should place core-site.xml, hdfs-site.xml, and hive-site.xml in the conf
directory of both FE and BE. During reading, the hadoop configuration files in
the conf directory are read first, followed by the configuration files related
to the HADOOP_CONF_DIR environment variable.
## Create Catalog
-### Hive On HDFS
+### Hive on HDFS
-```sql
+```SQL
CREATE CATALOG hive PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.0.0.1:9083',
@@ -54,11 +54,11 @@ CREATE CATALOG hive PROPERTIES (
);
```
-In addition to the two required parameters of `type` and
`hive.metastore.uris`, more parameters can be passed to pass the information
required for the connection.
+In addition to the two required parameters of `type` and
`hive.metastore.uris`, you can specify more parameters to pass the required
information for the connection.
-If HDFS HA information is provided, the example is as follows:
+Example to provide HDFS HA information:
-```sql
+```SQL
CREATE CATALOG hive PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.0.0.1:9083',
@@ -71,33 +71,9 @@ CREATE CATALOG hive PROPERTIES (
);
```
-Provide HDFS HA information and Kerberos authentication information at the
same time, examples are as follows:
+### Hive on VIEWFS
-```sql
-CREATE CATALOG hive PROPERTIES (
- 'type'='hms',
- 'hive.metastore.uris' = 'thrift://172.0.0.1:9083',
- 'hive.metastore.sasl.enabled' = 'true',
- 'hive.metastore.kerberos.principal' = 'your-hms-principal',
- 'dfs.nameservices'='your-nameservice',
- 'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
- 'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:8088',
- 'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:8088',
-
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider',
- 'hadoop.security.authentication' = 'kerberos',
- 'hadoop.kerberos.keytab' = '/your-keytab-filepath/your.keytab',
- 'hadoop.kerberos.principal' = '[email protected]',
- 'yarn.resourcemanager.principal' = 'your-rm-principal'
-);
-```
-
-Please place the `krb5.conf` file and `keytab` authentication file under all
`BE` and `FE` nodes. The path of the `keytab` authentication file is consistent
with the configuration. The `krb5.conf` file is placed in `/etc by default
/krb5.conf` path.
-
-The value of `hive.metastore.kerberos.principal` needs to be consistent with
the property of the same name of the connected hive metastore, which can be
obtained from `hive-site.xml`.
-
-### Hive On VIEWFS
-
-```sql
+```SQL
CREATE CATALOG hive PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.0.0.1:9083',
@@ -113,17 +89,17 @@ CREATE CATALOG hive PROPERTIES (
);
```
-viewfs related parameters can be added to the catalog configuration as above,
or added to `conf/core-site.xml`.
+You can add viewfs-related parameters to the catalog configuration as above,
or add them to `conf/core-site.xml`.
-How viewfs works and parameter configuration, please refer to relevant hadoop
documents, for example,
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ViewFs.html
+For details about how viewfs works and the parameter configurations, please
refer to relevant hadoop documents, such as
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/ViewFs.html
-### Hive On JuiceFS
+### Hive on JuiceFS
-Data is stored in JuiceFS, examples are as follows:
+Example for data stored on JuiceFS:
-(Need to put `juicefs-hadoop-x.x.x.jar` under `fe/lib/` and
`apache_hdfs_broker/lib/`)
+(You should put `juicefs-hadoop-x.x.x.jar` under `fe/lib/` and
`apache_hdfs_broker/lib/`.)
-```sql
+```SQL
CREATE CATALOG hive PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.0.0.1:9083',
@@ -134,9 +110,9 @@ CREATE CATALOG hive PROPERTIES (
);
```
-### Hive On S3
+### Hive on S3
-```sql
+```SQL
CREATE CATALOG hive PROPERTIES (
"type"="hms",
"hive.metastore.uris" = "thrift://172.0.0.1:9083",
@@ -148,15 +124,15 @@ CREATE CATALOG hive PROPERTIES (
);
```
-Options:
+Optional properties:
-* s3.connection.maximum: s3 maximum connection number, default 50
-* s3.connection.request.timeout: s3 request timeout, default 3000ms
-* s3.connection.timeout: s3 connection timeout, default 1000ms
+- s3.connection.maximum: defaults to 50
+- s3.connection.request.timeout: defaults to 3000ms
+- s3.connection.timeout: defaults to 1000ms
-### Hive On OSS
+### Hive on OSS
-```sql
+```SQL
CREATE CATALOG hive PROPERTIES (
"type"="hms",
"hive.metastore.uris" = "thrift://172.0.0.1:9083",
@@ -166,9 +142,9 @@ CREATE CATALOG hive PROPERTIES (
);
```
-### Hive On OBS
+### Hive on OBS
-```sql
+```SQL
CREATE CATALOG hive PROPERTIES (
"type"="hms",
"hive.metastore.uris" = "thrift://172.0.0.1:9083",
@@ -178,9 +154,9 @@ CREATE CATALOG hive PROPERTIES (
);
```
-### Hive On COS
+### Hive on COS
-```sql
+```SQL
CREATE CATALOG hive PROPERTIES (
"type"="hms",
"hive.metastore.uris" = "thrift://172.0.0.1:9083",
@@ -190,9 +166,24 @@ CREATE CATALOG hive PROPERTIES (
);
```
-### Hive With Glue
+### Hive with AWS Glue
+
+:::tip
+When connecting to Glue, if you are in a non-EC2 environment, you need to copy
the `~/.aws` directory from the EC2 environment to your current environment.
Alternatively, you can download the AWS
[CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)
tool for configuration, which will also create the `.aws` directory in the
current user's home directory.
+:::
+
+Access to AWS services with Doris supports two authentication methods.
-```sql
+**Authentication using Catalog properties**
+
+You can fill in the basic properties of credentials, such as:
+
+- When accessing S3, you can use `s3.endpoint`, `s3.access_key`,
`s3.secret_key`.
+- When accessing Glue, you can use `glue.endpoint`, `glue.access_key`,
`glue.secret_key`.
+
+For example, if you are accessing Glue using Iceberg Catalog, you can fill in
the following properties to access tables hosted on Glue:
+
+```SQL
CREATE CATALOG hive PROPERTIES (
"type"="hms",
"hive.metastore.type" = "glue",
@@ -202,22 +193,33 @@ CREATE CATALOG hive PROPERTIES (
);
```
+**Authentication using system properties**
+
+This authentication method is used for applications running on AWS resources
such as EC2 instances. It avoids hard-coding credentials and enhances data
security.
+
+If you create a Catalog without filling in the Credentials properties, the
DefaultAWSCredentialsProviderChain will be used by default. It can read
properties configured in system environment variables or instance profiles.
+
+For configurations of environment variables and system properties, refer to
[AWS
CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html).
+
+- Optional environment variables include: `AWS_ACCESS_KEY_ID`,
`AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN`, `AWS_ROLE_ARN`,
`AWS_WEB_IDENTITY_TOKEN_FILE`, etc.
+- You can follow [AWS
Configure](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html)
to directly configure the Credentials information, which will generate a
credentials file in the `~/.aws` directory.
+
## Metadata Cache & Refresh
-For Hive Catalog, 4 types of metadata are cached in Doris:
+For Hive Catalog, Doris caches 4 types of metadata:
-1. Table structure: cache table column information, etc.
-2. Partition value: Cache the partition value information of all partitions of
a table.
-3. Partition information: Cache the information of each partition, such as
partition data format, partition storage location, partition value, etc.
-4. File information: Cache the file information corresponding to each
partition, such as file path location, etc.
+- Table schema: table column information, etc.
+- Partition value: partition value information of all partitions in a table.
+- Partition information: information of each partition, such as partition data
format, partition storage location, partition value, etc.
+- File information: file information corresponding to each partition, such as
file path location, etc.
-The above cache information will not be persisted to Doris, so operations such
as restarting Doris's FE node, switching masters, etc. may cause the cache to
become invalid. After the cache expires, Doris will directly access the Hive
MetaStore to obtain information and refill the cache.
+The above cached information will not be persisted to Doris, so operations
such as restarting Doris FE node, switching to masters may invalidate the
cache. If the cache expires, Doris will directly access the Hive MetaStore to
obtain information and refill the cache.
-Metadata cache can be updated automatically, manually, or configured with TTL
(Time-to-Live) according to user needs.
+Metadata cache can be updated automatically, manually, or configured with TTL
(Time-to-Live).
### Default behavior and TTL
-By default, the metadata cache expires 10 minutes after it is first accessed.
This time is determined by the configuration parameter
`external_cache_expire_time_minutes_after_access` in fe.conf. (Note that in
versions 2.0.1 and earlier, the default value for this parameter was 1 day).
+By default, the metadata cache expires 10 minutes after it is first accessed.
The TTL is determined by the configuration parameter
`external_cache_expire_time_minutes_after_access` in fe.conf. (Note that in
versions 2.0.1 and earlier, the default value for this parameter was 1 day).
For example, if the user accesses the metadata of table A for the first time
at 10:00, then the metadata will be cached and will automatically expire after
10:10. If the user accesses the same metadata again at 10:11, Doris will
directly access the Hive MetaStore to obtain information and refill the cache.
@@ -225,7 +227,7 @@ For example, if the user accesses the metadata of table A
for the first time at
For the `INSERT INTO OVERWRITE PARTITION` operation commonly used in Hive, you
can also timely update the `File Information Cache` by configuring the TTL of
the `File Information Cache`:
-```
+```Plain
CREATE CATALOG hive PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.0.0.1:9083',
@@ -241,39 +243,37 @@ You can also set this value to 0 to disable file caching,
which will fetch file
Users need to manually refresh the metadata through the
[REFRESH](../../sql-manual/sql-reference/Utility-Statements/REFRESH.md) command.
-1. REFRESH CATALOG: Refresh the specified Catalog.
-
- ```
- REFRESH CATALOG ctl1 PROPERTIES("invalid_cache" = "true");
- ```
+**REFRESH CATALOG: Refresh the specified Catalog.**
- This command will refresh the database list, table list, and all cache
information of the specified Catalog.
-
- `invalid_cache` indicates whether to flush the cache. Defaults to true.
If it is false, only the database and table list of the catalog will be
refreshed, but the cache information will not be refreshed. This parameter is
applicable when the user only wants to synchronize newly added or deleted
database/table information.
+```Shell
+REFRESH CATALOG ctl1 PROPERTIES("invalid_cache" = "true");
+```
-2. REFRESH DATABASE: Refresh the specified Database.
+- This command refreshes the database list, table list, and all cached
information of the specified Catalog.
+- `invalid_cache` indicates whether to refresh the cache. It defaults to true.
If it is false, it will only refresh the database and table list of the
catalog, but not the cached information. This parameter is applicable when the
user only wants to synchronize newly added or deleted database/table
information.
- ```
- REFRESH DATABASE [ctl.]db1 PROPERTIES("invalid_cache" = "true");
- ```
+**REFRESH DATABASE: Refresh the specified Database.**
- This command will refresh the table list of the specified Database and
all cached information under the Database.
+```Shell
+REFRESH DATABASE [ctl.]db1 PROPERTIES("invalid_cache" = "true");
+```
- The meaning of the `invalid_cache` attribute is the same as above.
Defaults to true. If false, only the Database's table list will be refreshed,
not cached information. This parameter is suitable for users who only want to
synchronize newly added or deleted table information.
+- This command refreshes the table list of the specified database and all
cached information under the database.
+- `invalid_cache` indicates the same as above. It defaults to true. If false,
it will only refresh the table list of the database, but not the cached
information. This parameter is applicable when the user only wants to
synchronize newly added or deleted table information.
-3. REFRESH TABLE: Refresh the specified Table.
+**REFRESH TABLE: Refresh the specified Table.**
- ```
- REFRESH TABLE [ctl.][db.]tbl1;
- ```
+```Shell
+REFRESH TABLE [ctl.][db.]tbl1;
+```
- This command will refresh all cache information under the specified Table.
+- This command refreshes all cached information under the specified table.
### Regular refresh
-Users can set the scheduled refresh of the Catalog when creating the Catalog.
+Users can schedule the refresh for a Catalog when creating the Catalog.
-```
+```Plain
CREATE CATALOG hive PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.0.0.1:9083',
@@ -281,41 +281,43 @@ CREATE CATALOG hive PROPERTIES (
);
```
-In the above example, `metadata_refresh_interval_sec` means refreshing the
Catalog every 600 seconds. Equivalent to automatically executing every 600
seconds:
+In the above example, `metadata_refresh_interval_sec` means refreshing the
Catalog every 600 seconds, which means the following command will be executed
automatically every 600 seconds:
-`REFRESH CATALOG ctl1 PROPERTIES("invalid_cache" = "true");`
+```
+REFRESH CATALOG ctl1 PROPERTIES("invalid_cache" = "true");
+```
-The scheduled refresh interval must not be less than 5 seconds.
+Note that the scheduled refresh interval must not be less than 5 seconds.
-### Auto Refresh
+### Auto refresh
Currently, Doris only supports automatic update of metadata in Hive Metastore
(HMS). It perceives changes in metadata by the FE node which regularly reads
the notification events from HMS. The supported events are as follows:
| Event | Corresponding Update Operation
|
-| :-------------- |
:----------------------------------------------------------- |
+| --------------- |
------------------------------------------------------------ |
| CREATE DATABASE | Create a database in the corresponding catalog.
|
| DROP DATABASE | Delete a database in the corresponding catalog.
|
-| ALTER DATABASE | Such alterations mainly include changes in properties,
comments, or storage location of databases. They do not affect Doris' queries
in External Catalogs so they will not be synchronized. |
+| ALTER DATABASE | Such alterations mainly include changes in properties,
comments, or storage location of databases. They do not affect Doris' queries
in External Catalogs. |
| CREATE TABLE | Create a table in the corresponding database.
|
| DROP TABLE | Delete a table in the corresponding database, and
invalidate the cache of that table. |
-| ALTER TABLE | If it is a renaming, delete the table of the old name, and
then create a new table with the new name; otherwise, invalidate the cache of
that table. |
+| ALTER TABLE | If it is renaming, delete the table of the old name, and
then create a new table with the new name; otherwise, invalidate the cache of
that table. |
| ADD PARTITION | Add a partition to the cached partition list of the
corresponding table. |
| DROP PARTITION | Delete a partition from the cached partition list of the
corresponding table, and invalidate the cache of that partition. |
| ALTER PARTITION | If it is a renaming, delete the partition of the old name,
and then create a new partition with the new name; otherwise, invalidate the
cache of that partition. |
-> After data ingestion, changes in partition tables will follow the `ALTER
PARTITION` logic, while those in non-partition tables will follow the `ALTER
TABLE` logic.
->
-> If changes are conducted on the file system directly instead of through the
HMS, the HMS will not generate an event. As a result, such changes will not be
perceived by Doris.
+After data ingestion, changes in partition tables will follow the `ALTER
PARTITION` logic, while those in non-partition tables will follow the `ALTER
TABLE` logic.
+
+If changes are conducted on the file system directly instead of through the
HMS, the HMS will not generate an event. As a result, such changes will not be
perceived by Doris.
The automatic update feature involves the following parameters in fe.conf:
-1. `enable_hms_events_incremental_sync`: This specifies whether to enable
automatic incremental synchronization for metadata, which is disabled by
default.
-2. `hms_events_polling_interval_ms`: This specifies the interval between two
readings, which is set to 10000 by default. (Unit: millisecond)
-3. `hms_events_batch_size_per_rpc`: This specifies the maximum number of
events that are read at a time, which is set to 500 by default.
+- `enable_hms_events_incremental_sync`: This specifies whether to enable
automatic incremental synchronization for metadata, which is disabled by
default.
+- `hms_events_polling_interval_ms`: This specifies the interval between two
readings, which is set to 10000 by default. (Unit: millisecond)
+- `hms_events_batch_size_per_rpc`: This specifies the maximum number of events
that are read at a time, which is set to 500 by default.
To enable automatic update(Excluding Huawei MRS), you need to modify the
hive-site.xml of HMS and then restart HMS and HiveServer2:
-```
+```Plain
<property>
<name>hive.metastore.event.db.notification.api.auth</name>
<value>false</value>
@@ -328,43 +330,24 @@ To enable automatic update(Excluding Huawei MRS), you
need to modify the hive-si
<name>hive.metastore.transactional.event.listeners</name>
<value>org.apache.hive.hcatalog.listener.DbNotificationListener</value>
</property>
-
```
-Huawei's MRS needs to change hivemetastore-site.xml and restart HMS and
HiveServer2:
+For Huawei MRS, you need to change hivemetastore-site.xml and restart HMS and
HiveServer2:
-```
+```Plain
<property>
<name>metastore.transactional.event.listeners</name>
<value>org.apache.hive.hcatalog.listener.DbNotificationListener</value>
</property>
```
-Note: Value is appended with commas separated from the original value, not
overwritten.For example, the default configuration for MRS 3.1.0 is
-
-```
-<property>
- <name>metastore.transactional.event.listeners</name>
-
<value>com.huawei.bigdata.hive.listener.TableKeyFileManagerListener,org.apache.hadoop.hive.metastore.listener.FileAclListener</value>
-</property>
-```
-
-We need to change to
-
-```
-<property>
- <name>metastore.transactional.event.listeners</name>
-
<value>com.huawei.bigdata.hive.listener.TableKeyFileManagerListener,org.apache.hadoop.hive.metastore.listener.FileAclListener,org.apache.hive.hcatalog.listener.DbNotificationListener</value>
-</property>
-```
-
-> Note: To enable automatic update, whether for existing Catalogs or newly
created Catalogs, all you need is to set `enable_hms_events_incremental_sync`
to `true`, and then restart the FE node. You don't need to manually update the
metadata before or after the restart.
-
## Hive Version
-Doris can correctly access the Hive Metastore in different Hive versions. By
default, Doris will access the Hive Metastore with a Hive 2.3 compatible
interface. You can also specify the hive version when creating the Catalog. If
accessing Hive 1.1.0 version:
+Doris can correctly access the Hive Metastore in different Hive versions. By
default, Doris will access the Hive Metastore with a Hive 2.3 compatible
interface. If you encounter the `Invalid method name: 'get_table_req'` error
during queries, the root cause is Hive version incompatibility.
-```sql
+You can specify the Hive version when creating the Catalog. For example, to
access Hive 1.1.0 version:
+
+```Plain
CREATE CATALOG hive PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.0.0.1:9083',
@@ -376,126 +359,214 @@ CREATE CATALOG hive PROPERTIES (
For Hive/Iceberge/Hudi
-| HMS Type | Doris Type | Comment |
-|---|---|---|
-| boolean| boolean | |
-| tinyint|tinyint | |
-| smallint| smallint| |
-| int| int | |
-| bigint| bigint | |
-| date| date| |
-| timestamp| datetime| |
-| float| float| |
-| double| double| |
-| char| char | |
-| varchar| varchar| |
-| decimal| decimal | |
-| `array<type>` | `array<type>`| support nested type, for example
`array<array<int>>` |
-| `map<KeyType, ValueType>` | `map<KeyType, ValueType>` | support nested type,
for example `map<string, array<int>>` |
-| `struct<col1: Type1, col2: Type2, ...>` | `struct<col1: Type1, col2: Type2,
...>` | support nested type, for example `struct<col1: array<int>, col2:
map<int, date>>` |
-| other | unsupported | |
-
-## Whether to truncate char or varchar columns according to the schema of the
hive table
-
-If the variable `truncate_char_or_varchar_columns` is enabled, when the
maximum length of the char or varchar column in the schema of the hive table is
inconsistent with the schema in the underlying parquet or orc file, it will be
truncated according to the maximum length of the hive table column.
-
-The variable default is false.
+| HMS Type | Doris Type
| Comment |
+| ------------------------------------- |
------------------------------------- |
------------------------------------------------------------ |
+| boolean | boolean
| |
+| tinyint | tinyint
| |
+| smallint | smallint
| |
+| int | int
| |
+| bigint | bigint
| |
+| date | date
| |
+| timestamp | datetime
| |
+| float | float
| |
+| double | double
| |
+| char | char
| |
+| varchar | varchar
| |
+| decimal | decimal
| |
+| array<type> | array<type>
| Supports nested structure, such as array<map<string, int>> |
+| map<KeyType, ValueType> | map<KeyType, ValueType>
| Supports nested structure, such as map<string, array<int>> |
+| struct<col1: Type1, col2: Type2, ...> | struct<col1: Type1, col2: Type2,
...> | Supports nested structure, such as struct<col1: array<int>, col2:
map<int, date>> |
+| other | unsupported
| |
+
+:::tip
+**Truncate char or varchar columns**
+
+If `truncate_char_or_varchar_columns` is enabled, and the maximum length of
the char or varchar column in the schema of the hive table is inconsistent with
the schema in the underlying Parquet or ORC file, the columns will be truncated
according to the maximum length of the hive table column.
+
+The variable defaults to false.
+:::
## Access HMS with broker
-Add following setting when creating an HMS catalog, file splitting and
scanning for Hive external table will be completed by broker named `test_broker`
+Add the following setting when creating an HMS catalog, file splitting and
scanning for Hive external table will be completed by the broker named
`test_broker`.
-```sql
+```SQL
"broker.name" = "test_broker"
```
+Doris supports querying HMS Catalog Iceberg by implementing the Iceberg FileIO
interface in the Broker. If needed, you can add the following configuration
when creating an HMS Catalog.
+
+```SQL
+"io-impl" = "org.apache.doris.datasource.iceberg.broker.IcebergBrokerIO"
+```
+
## Integrate with Apache Ranger
Apache Ranger is a security framework for monitoring, enabling services, and
comprehensive data security access management on the Hadoop platform.
-Currently doris supports ranger library, table, and column permissions, but
does not support encryption, row permissions, etc.
+Doris supports authentication on specified External Hive Catalog using Apache
Ranger.
+
+Currently Doris supports authentication of Ranger libraries, tables, and
columns, but does not support encryption, row privileges, and Data Mask.
-### Settings
+If you need to use Apache Ranger for authentication of the entire Doris
cluster, refer to [Integration with Apache
Ranger](https://doris.apache.org/docs/dev/admin-manual/privilege-ldap/ranger/).
-To connect to the Hive Metastore with Ranger permission verification enabled,
you need to add configuration & configuration environment:
+### Environment settings
+
+To connect to the Hive Metastore with Ranger authentication enabled, you need
to add configurations and environment settings as follows:
1. When creating a Catalog, add:
-```sql
+```SQL
"access_controller.properties.ranger.service.name" = "hive",
"access_controller.class" =
"org.apache.doris.catalog.authorizer.RangerHiveAccessControllerFactory",
```
+:::note
+ `access_controller.properties.ranger.service.name` refers to the type of
service, such as `hive`, `hdfs`, etc. It does not correspond to the value of
`ranger.plugin.hive.service.name` in the configuration file.
+:::
+
2. Configure all FE environments:
- 1. Copy the configuration files ranger-hive-audit.xml,
ranger-hive-security.xml, and ranger-policymgr-ssl.xml under the HMS conf
directory to the FE conf directory.
-
- 2. Modify the properties of ranger-hive-security.xml, the reference
configuration is as follows:
-
- ```sql
- <?xml version="1.0" encoding="UTF-8"?>
- <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
- <configuration>
- #The directory for caching permission data, needs to be writable
- <property>
- <name>ranger.plugin.hive.policy.cache.dir</name>
- <value>/mnt/datadisk0/zhangdong/rangerdata</value>
- </property>
- #The time interval for periodically pulling permission data
- <property>
- <name>ranger.plugin.hive.policy.pollIntervalMs</name>
- <value>30000</value>
- </property>
-
- <property>
-
<name>ranger.plugin.hive.policy.rest.client.connection.timeoutMs</name>
- <value>60000</value>
- </property>
-
- <property>
-
<name>ranger.plugin.hive.policy.rest.client.read.timeoutMs</name>
- <value>60000</value>
- </property>
-
- <property>
- <name>ranger.plugin.hive.policy.rest.ssl.config.file</name>
- <value></value>
- </property>
-
- <property>
- <name>ranger.plugin.hive.policy.rest.url</name>
- <value>http://172.21.0.32:6080</value>
- </property>
-
- <property>
- <name>ranger.plugin.hive.policy.source.impl</name>
-
<value>org.apache.ranger.admin.client.RangerAdminRESTClient</value>
- </property>
-
- <property>
- <name>ranger.plugin.hive.service.name</name>
- <value>hive</value>
- </property>
-
- <property>
- <name>xasecure.hive.update.xapolicies.on.grant.revoke</name>
- <value>true</value>
- </property>
-
- </configuration>
- ```
-
- 3. In order to obtain the log of Ranger authentication itself, add the
configuration file log4j.properties in the `<doris_home>/conf` directory.
-
- 4. Restart FE.
-
-### Best Practices
-
-1. Create user user1 on the ranger side and authorize the query permission of
db1.table1.col1
-
-2. Create role role1 on the ranger side and authorize the query permission of
db1.table1.col2
-
-3. Create a user user1 with the same name in doris, user1 will directly have
the query authority of db1.table1.col1
-
-4. Create role1 with the same name in doris, and assign role1 to user1, user1
will have the query authority of db1.table1.col1 and col2 at the same time
+a. Copy the configuration files ranger-hive-audit.xml,
ranger-hive-security.xml, and ranger-policymgr-ssl.xml under the HMS conf
directory to the FE conf directory.
+
+b. Modify the properties in ranger-hive-security.xml, an example is as
follows:
+
+```SQL
+<?xml version="1.0" encoding="UTF-8"?>
+<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
+<configuration>
+ #The directory for caching permission data, needs to be writable
+ <property>
+ <name>ranger.plugin.hive.policy.cache.dir</name>
+ <value>/mnt/datadisk0/zhangdong/rangerdata</value>
+ </property>
+ #The time interval for periodically pulling permission data
+ <property>
+ <name>ranger.plugin.hive.policy.pollIntervalMs</name>
+ <value>30000</value>
+ </property>
+
+ <property>
+ <name>ranger.plugin.hive.policy.rest.client.connection.timeoutMs</name>
+ <value>60000</value>
+ </property>
+
+ <property>
+ <name>ranger.plugin.hive.policy.rest.client.read.timeoutMs</name>
+ <value>60000</value>
+ </property>
+
+ <property>
+ <name>ranger.plugin.hive.policy.rest.ssl.config.file</name>
+ <value></value>
+ </property>
+
+ <property>
+ <name>ranger.plugin.hive.policy.rest.url</name>
+ <value>http://172.21.0.32:6080</value>
+ </property>
+
+ <property>
+ <name>ranger.plugin.hive.policy.source.impl</name>
+ <value>org.apache.ranger.admin.client.RangerAdminRESTClient</value>
+ </property>
+
+ <property>
+ <name>ranger.plugin.hive.service.name</name>
+ <value>hive</value>
+ </property>
+
+ <property>
+ <name>xasecure.hive.update.xapolicies.on.grant.revoke</name>
+ <value>true</value>
+ </property>
+
+</configuration>
+```
+
+c. In order to obtain the logs of Ranger authentication, add the configuration
file log4j.properties in the `<doris_home>/conf` directory.
+
+d. Restart FE.
+
+### Best Practice
+
+1. Create `user1` on the Ranger side and authorize the query permission of
db1.table1.col1
+2. Create `role1` on the Ranger side and authorize the query permission of
db1.table1.col2
+3. Create the same `user1` in Doris, `user1` will be directly granted with the
query permission of db1.table1.col1
+4. Create the same `role1` in Doris, and assign `role1` to `user1`. `user1`
will be directly granted with the query permission of db1.table1.col1 and col2
at the same time.
+5. The Admin and Root privileges of Doris are not impacted by Apache Ranger.
+
+## Authentication with Kerberos
+
+Kerberos is an authentication protocol designed to provide strong identity
verification for applications through the use of encryption with private keys.
+
+### Environment settings
+
+1. When the services in the cluster are configured with Kerberos
authentication, the relevant authentication information is needed in Hive
Catalog configuration.
+
+- `hadoop.kerberos.keytab`: the principal required for authentication. It
should be the same as the keytab in the Doris cluster.
+- `hadoop.kerberos.principal`: Find the corresponding principal for the
hostname in the Doris cluster, such as `doris/``[email protected]`, and
verify with `klist -kt` using the keytab.
+- `yarn.resourcemanager.principal`: Go to the Yarn Resource Manager node and
retrieve this from `yarn-site.xml`. Verify the keytab of Yarn with `klist -kt`.
+- `hive.metastore.kerberos.principal`: Go to the Hive metadata service node
and retrieve this from `hive-site.xml`. Verify the keytab of Hive with `klist
-kt`.
+- `hadoop.security.authentication`: Enable Hadoop Kerberos authentication.
+
+Place the `krb5.conf` file and `keytab` authentication file on all BE and FE
nodes. Ensure consistencyt of the path of the `keytab` authentication file. By
default, the `krb5.conf` file is placed in the `/etc/krb5.conf` path. Also,
confirm that the JVM parameter `-Djava.security.krb5.conf` and the environment
variable `KRB5_CONFIG`point to the correct path of the `krb5.conf` file.
+
+2. After the configuration is completed, if you cannot locate the issue in the
FE and BE logs, you can enable Kerberos debugging.
+
+- On all FE and BE nodes, locate the `conf/fe.conf` and `conf/be.conf` files
in the deployment path.
+- After finding the configuration files, set the JVM parameter
`-Dsun.security.krb5.debug=true` in the `JAVA_OPTS` variable to enable Kerberos
debugging.
+- The FE Kerberos authentication debug information can be found in the FE log
path `log/fe.out`, and the BE Kerberos authentication debug information can be
found in the BE log path `log/be.out`.
+- For solutions to more common issues, refer to
[FAQ](https://doris.apache.org/docs/dev/lakehouse/faq/).
+
+### Best practice
+
+Example:
+
+```SQL
+CREATE CATALOG hive_krb PROPERTIES (
+ 'type'='hms',
+ 'hive.metastore.uris' = 'thrift://172.0.0.1:9083',
+ 'hive.metastore.sasl.enabled' = 'true',
+ 'hive.metastore.kerberos.principal' = 'your-hms-principal',
+ 'hadoop.security.authentication' = 'kerberos',
+ 'hadoop.kerberos.keytab' = '/your-keytab-filepath/your.keytab',
+ 'hadoop.kerberos.principal' = '[email protected]',
+ 'yarn.resourcemanager.principal' = 'your-rm-principal'
+);
+```
+
+Example of HDFS HA information and Kerberos authentication information:
+
+```SQL
+CREATE CATALOG hive_krb_ha PROPERTIES (
+ 'type'='hms',
+ 'hive.metastore.uris' = 'thrift://172.0.0.1:9083',
+ 'hive.metastore.sasl.enabled' = 'true',
+ 'hive.metastore.kerberos.principal' = 'your-hms-principal',
+ 'hadoop.security.authentication' = 'kerberos',
+ 'hadoop.kerberos.keytab' = '/your-keytab-filepath/your.keytab',
+ 'hadoop.kerberos.principal' = '[email protected]',
+ 'yarn.resourcemanager.principal' = 'your-rm-principal',
+ 'dfs.nameservices'='your-nameservice',
+ 'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
+ 'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:8088',
+ 'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:8088',
+
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
+);
+```
+
+## Hive transactional table
+
+Hive transactional tables are tables that support ACID semantics in Hive. For
more details, see:
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
+
+### Support for Hive transactional tables
+
+| Table Type | Supported Operations in Hive
| Hive Table Properties | Supported Hive
Versions |
+| ------------------------------- | ------------------------------------------
| ------------------------------------------------------------ |
------------------------------------------------------------ |
+| Full-ACID Transactional Table | Supports Insert, Update, Delete operations
| 'transactional'='true', 'transactional_properties'='insert_only' | 3.x, 2.x.
In 2.x, loading can only be performed after major compaction is executed in
Hive. |
+| Insert-Only Transactional Table | Only supports Insert operations
| 'transactional'='true' | 3.x,2.x
|
+
+### Current restrictions
+Currently, scenarios involving Original Files are not supported. After a table
is converted into a transactional table, subsequent newly written data files
will follow the schema of Hive Transactional table. However, existing data
files will not be converted to the schema of the Transactional table. These
files are referred to as Original Files.
diff --git a/versioned_docs/version-2.0/table-design/best-practice.md
b/versioned_docs/version-2.0/table-design/best-practice.md
index 5c4333bd7b8c..216ffcdcffd7 100644
--- a/versioned_docs/version-2.0/table-design/best-practice.md
+++ b/versioned_docs/version-2.0/table-design/best-practice.md
@@ -27,156 +27,451 @@ under the License.
# Best Practices
-## 1 tabulation
+## Data models
-### 1.1 Data Model Selection
+> Doris arranges data in three models: DUPLICATE KEY model, UNIQUE KEY model,
and AGGREGATE KEY model.
-Doris data model is currently divided into three categories: AGGREGATE KEY,
UNIQUE KEY, DUPLICATE KEY. Data in all three models are sorted by KEY.
+:::tip
+**Recommendations**
-1.1.1. AGGREGATE KEY
+Because the data model is determined and **immutable** during table creation,
it is important to select the most suitable data model.
-When AGGREGATE KEY is the same, old and new records are aggregated. The
aggregation functions currently supported are SUM, MIN, MAX, REPLACE.
+1. The Duplicate Key model is a good fit for ad-hoc queries across any
dimensions. It cannot leverage the benefits of pre-aggregation, but it is also
not constrained by the limitations of an aggregation model, so it can take
advantage of columnar storage (only reading the relevant columns without the
need to read all key columns).
+2. The Aggregate Key model, through pre-aggregation, can largely reduce the
amount of data scanned and the computational workload for aggregate queries. It
is particularly suitable for reporting queries with fixed patterns. However,
this model is not friendly for count(*) queries. Additionally, since the
aggregation method on the Value columns are fixed, users should pay extra
attention to semantic correctness when performing other types of aggregate
queries.
+3. The Unique Key model is designed for scenarios that require unique primary
key constraints. It can ensure the uniqueness of primary keys. The downside is
that it cannot reap the benefits brought by materialization and
pre-aggregation. For users with high-performance requirements for aggregate
queries, it is recommended to use the Merge-on-Write feature of the Unique Key
model since Doris 1.2.
+4. Users with partial column update requirements might select from the
following data models:
+ 1. Unique Key model (Merge-on-Write mode)
+ 2. Aggregate Key model (aggregated by REPLACE_IF_NOT_NULL)
+:::
-AGGREGATE KEY model can aggregate data in advance and is suitable for
reporting and multi-dimensional analysis business.
+### DUPLICATE KEY model
-```
-CREATE TABLE site_visit
+
+
+When only the sorting columns are specified, rows with the same key will not
be merged.
+
+This is applicable to analytical business scenarios where data does not
require pre-aggregation:
+
+- Analysis of raw data
+- Analysis of log or time series data where only new data is appended.
+
+**Best practice**
+
+```SQL
+-- For example, log analysis that allows only appending new data with
replicated KEYs.
+CREATE TABLE session_data
(
-siteid INT,
-City: SMALLINT,
-username VARCHAR (32),
-pv BIGINT SUM DEFAULT '0'
+ visitorid SMALLINT,
+ sessionid BIGINT,
+ visittime DATETIME,
+ city CHAR(20),
+ province CHAR(20),
+ ip varchar(32),
+ brower CHAR(20),
+ url VARCHAR(1024)
)
-AGGREGATE KEY(siteid, city, username)
-DISTRIBUTED BY HASH(siteid) BUCKETS 10;
+DUPLICATE KEY(visitorid, sessionid) -- Used solely for specifying sorting
columns, rows with the same KEY will not be merged.
+DISTRIBUTED BY HASH(sessionid, visitorid) BUCKETS 10;
```
-1.1.2. UNIQUE KEY
+### AGGREGATE KEY model
-When UNIQUE KEY is the same, the new record covers the old record. Before
version 1.2, UNIQUE KEY implements the same REPLACE aggregation method as
AGGREGATE KEY, and they are essentially the same. We introduced a new
merge-on-write implementation for UNIQUE KEY since version 1.2, which have
better performance on many scenarios. Suitable for analytical business with
updated requirements.
+
-```
-CREATE TABLE sales_order
+Old and new records with the same AGGREGATE KEY will be aggregated. The
currently supported aggregation methods are as follows:
+
+1. SUM: calculates the sum by accumulating the values of multiple rows;
+2. REPLACE: replaces the value in the previously imported rows with the value
from the next batch of data;
+3. MAX: retains the maximum value;
+4. MIN: retains the minimum value;
+5. REPLACE_IF_NOT_NULL: replaces non-null values. Unlike REPLACE, it does not
replace null values;
+6. HLL_UNION: aggregates columns of HLL type using the HyperLogLog algorithm;
+7. BITMAP_UNION: aggregates columns of BITMAP type using bitmap union
aggregation;
+
+This is suitable for reporting and multi-dimensional analysis scenarios such
as:
+
+- Website traffic analysis
+- Multi-dimensional analysis of data reports.
+
+**Best practice**
+
+```SQL
+-- Example of website traffic analysis
+CREATE TABLE site_visit
(
-orderid BIGINT,
-status TINYINT,
-username VARCHAR (32),
-amount BIGINT DEFAULT '0'
+ siteid INT,
+ city SMALLINT,
+ username VARCHAR(32),
+ pv BIGINT SUM DEFAULT '0' -- PV caculation
)
-KEY (orderid) UNIT
-DISTRIBUTED BY HASH(orderid) BUCKETS 10;
+AGGREGATE KEY(siteid, city, username) -- Rows with the same KEY will be
merged, and non-key columns will be aggregated based on the specified
aggregation function.
+DISTRIBUTED BY HASH(siteid) BUCKETS 10;
```
-1.1.3. DUPLICATE KEY
+### UNIQUE KEY model
-Only sort columns are specified, and the same rows are not merged. It is
suitable for the analysis business where data need not be aggregated in advance.
+The new record will replace the old record that has the same UNIQUE KEY with
it. Before Doris 1.2, the UNIQUE KEY model was implemented the same way as the
REPLACE aggregation of the AGGREGATE KEY model. However, Since Doris 1.2, we
have introduced the Merge-on-Write implementation for the UNIQUE KEY model,
which provides better performance for aggregate queries.
-```
-CREATE TABLE session_data
+This is suitable for analytical business scenarios that require updates, such
as:
+
+- Deduplicated order analysis
+- Real-time synchronization of inserts, updates, and deletes.
+
+**Best practice**
+
+```SQL
+-- Example of deduplicated order analysis
+CREATE TABLE sales_order
(
-visitorid SMALLINT,
-sessionid BIGINT,
-visit time DATETIME,
-City CHAR (20),
-province CHAR(20),
-ip. varchar (32),
-brower CHAR(20),
-url: VARCHAR (1024)
+ orderid BIGINT,
+ status TINYINT,
+ username VARCHAR(32),
+ amount BIGINT DEFAULT '0'
)
-DUPLICATE KEY (visitor time, session time)
-DISTRIBUTED BY HASH(sessionid, visitorid) BUCKETS 10;
+UNIQUE KEY(orderid) -- Rows of the same KEY will be merged
+DISTRIBUTED BY HASH(orderid) BUCKETS 10;
```
-### 1.2 Wide Table vs. Star Schema
-
-When the business side builds tables, in order to adapt to the front-end
business, they often do not distinguish between dimension information and
indicator information, and define the Schema as a large wide table, this
operation is actually not so friendly to the database, we recommend users to
use the star model.
+## Index
-* There are many fields in Schema, and there may be more key columns in the
aggregation model. The number of columns that need to be sorted in the import
process will increase.
-* Dimensional information updates are reflected in the whole table, and the
frequency of updates directly affects the efficiency of queries.
+> Indexes can facilitate quick filtering and searching of data. Currently,
Doris supports two types of indexes:
+>
+> 1. Built-in smart indexes, including prefix index and ZoneMap index.
+> 2. User-created secondary indexes, including inverted index, BloomFilter
index, Ngram BloomFilter index, and Bitmap index.
-In the process of using Star Schema, users are advised to use Star Schema to
distinguish dimension tables from indicator tables as much as possible.
Frequently updated dimension tables can also be placed in MySQL external
tables. If there are only a few updates, they can be placed directly in Doris.
When storing dimension tables in Doris, more copies of dimension tables can be
set up to improve Join's performance.
+### **Prefix index**
-### 1.3 Partitioning and Bucketing
+Prefix indexes are built-in indexes in the Aggregate, Unique, and Duplicate
data models. The underlying data storage is sorted and stored based on the
columns specified as the AGGREGATE KEY, UNIQUE KEY, or DUPLICATE KEY in their
respective table creation statements. Prefix index, built on top of the sorted
data, allows for quick data querying based on given prefix columns.
-Doris supports two-level partitioned storage. The first level is partition,
which currently supports both RANGE and LIST partition types, and the second
layer is HASH bucket.
+Prefix indexes are sparse indexes and cannot locate the exact row where a key
is present. Instead, they can only roughly identify the range where the key may
exist, and then use binary search algorithms to accurately locate the position
of the key.
-1.3.1. Partitioning
+:::tip
+**Recommendations**
-Partition is used to divide data into different intervals, which can be
logically understood as dividing the original table into multiple sub-tables.
Data can be easily managed by partition, for example, to delete data more
quickly.
+1. When creating a table, **the correct column order can greatly improve query
efficiency**.
+ 1. Since the column order is specified during table creation, a table can
only have one type of prefix index. However, this may not be efficient enough
for queries based on the columns without prefix index. In such cases, users can
adjust the column order by creating materialized views.
+2. The first field in a prefix index should always be the field of the longest
query conditions and should be a high-cardinality field.
+ 1. Bucketing field: It should have relatively even data distribution and be
frequently used, preferably a high-cardinality field.
+ 2. Int(4) + Int(4) + varchar(50): The prefix index length is only 28.
+ 3. Int(4) + varchar(50) + Int(4): The prefix index length is only 24.
+ 4. varchar(10) + varchar(50): The prefix index length is only 30.
+ 5. Prefix index (36 characters): The first field delivers the best query
performance. If a varchar field is encountered, the prefix index will
automatically truncate it to the first 20 characters.
+ 6. If possible, include the most frequently used query fields in the prefix
index. Otherwise, specify them as the bucketing fields.
+3. The field lengths in the prefix index should be as explicit as possible
because Doris can only utilize the prefix index for the first 36 bytes.
+4. If it is difficult to design a partitioning, bucketing, and prefix index
strategy for your data range, consider introducing inverted indexes for
acceleration.
+:::
-1.3.1.1. Range Partitioning
+### **ZoneMap index**
-In business, most users will choose to partition on time, which has the
following advantages:
+ZoneMap index is index information automatically maintained on a per-column
basis in the columnar storage format. It includes information such as Min/Max
values and the number of Null values. During data querying, the ZoneMap index
is utilized to select the data range to scan based on the filtered fields using
range conditions.
-* Differentiable heat and cold data
-* Availability of Doris Hierarchical Storage (SSD + SATA)
+For example, when filtering the "age" field with the following query statement:
-1.3.1.2. List Partitioning
+```Shell
+SELECT * FROM table WHERE age > 0 and age < 51;
+```
-In business,, users can select cities or other enumeration values for
partition.
+If the Short Key Index is not hit, the ZoneMap index will be used to determine
the data range, known as the "ordinary" range, that needs to be scanned based
on the query conditions for the "age" field. This reduces the number of pages
that need to be scanned.
-1.3.2. Hash Bucketing
+### **Inverted index**
-The data is divided into different buckets according to the hash value.
+Doris supports inverted indexes since version 2.0.0. Inverted index can be
used for full-text searches on text data and range queries on regular numeric
and date types. It enables fast filtering of rows that meet the conditions from
massive amounts of data.
-* It is suggested that columns with large differentiation should be used as
buckets to avoid data skew.
-* In order to facilitate data recovery, it is suggested that the size of a
single bucket should not be too large and should be kept within 10GB.
Therefore, the number of buckets should be considered reasonably when building
tables or increasing partitions, among which different partitions can specify
different buckets.
+**Best practice**
-### 1.4 Sparse Index and Bloom Filter
+```SQL
+-- Inverted index can be specified during table creation or added later. This
is an example of specifying it during table creation:
+CREATE TABLE table_name
+(
+ columns_difinition,
+ INDEX idx_name1(column_name1) USING INVERTED [PROPERTIES("parser" =
"english|unicode|chinese")] [COMMENT 'your comment']
+ INDEX idx_name2(column_name2) USING INVERTED [PROPERTIES("parser" =
"english|unicode|chinese")] [COMMENT 'your comment']
+ INDEX idx_name3(column_name3) USING INVERTED [PROPERTIES("parser" =
"chinese", "parser_mode" = "fine_grained|coarse_grained")] [COMMENT 'your
comment']
+ INDEX idx_name4(column_name4) USING INVERTED [PROPERTIES("parser" =
"english|unicode|chinese", "support_phrase" = "true|false")] [COMMENT 'your
comment']
+ INDEX idx_name5(column_name4) USING INVERTED [PROPERTIES("char_filter_type"
= "char_replace", "char_filter_pattern" = "._"), "char_filter_replacement" = "
"] [COMMENT 'your comment']
+ INDEX idx_name5(column_name4) USING INVERTED [PROPERTIES("char_filter_type"
= "char_replace", "char_filter_pattern" = "._")] [COMMENT 'your comment']
+)
+table_properties;
-Doris stores the data in an orderly manner, and builds a sparse index for
Doris on the basis of ordered data. The index granularity is block (1024 rows).
+-- Example: keyword matching in full-text searches, implemented by MATCH_ANY
MATCH_ALL
+SELECT * FROM table_name WHERE column_name MATCH_ANY | MATCH_ALL 'keyword1
...';
+```
-Sparse index chooses fixed length prefix in schema as index content, and Doris
currently chooses 36 bytes prefix as index.
+:::tip
+**Recommendations**
-* When building tables, it is suggested that the common filter fields in
queries should be placed in front of Schema. The more distinguishable the query
fields are, the more frequent the query fields are.
-* One particular feature of this is the varchar type field. The varchar type
field can only be used as the last field of the sparse index. The index is
truncated at varchar, so if varchar appears in front, the length of the index
may be less than 36 bytes. Specifically, you can refer to [data
model](./data-model.md), [ROLLUP and query](./hit-the-rollup.md).
-* In addition to sparse index, Doris also provides bloomfilter index.
Bloomfilter index has obvious filtering effect on columns with high
discrimination. If you consider that varchar cannot be placed in a sparse
index, you can create a bloomfilter index.
+1. If it is difficult to design a partitioning, bucketing, and prefix index
strategy for your data range, consider introducing inverted indexes for
acceleration.
+:::
-### 1.5 Rollup
-Rollup can essentially be understood as a physical index of the original
table. When creating Rollup, only some columns in Base Table can be selected as
Schema. The order of fields in Schema can also be different from that in Base
Table.
+:::caution
+**Restrictions**
-Rollup can be considered in the following cases:
+1. Different data models have different restrictions on inverted index.
+ 1. Aggregate KEY model: only allows inverted index for Key columns
+ 2. Unique KEY model: allows inverted index for any column after enabling
Merge-on-Write
+ 3. Duplicate KEY model: allows inverted index for any column
+:::
-1.5.1. Low ratio of data aggregation in the Base Table
+### **BloomFilter index**
-This is usually due to the fact that Base Table has more differentiated
fields. At this point, you can consider selecting some columns and establishing
Rollup.
+Doris supports adding BloomFilter indexes to fields with high value
distinctiveness, making it suitable for scenarios that involve equivalence
queries on columns with high cardinality.
-For the `site_visit'table:
+**Best practice**
+```SQL
+-- Example: add "bloom_filter_columns"="k1,k2,k3" in the PROPERTIES of the
table creation statement.
+-- To create BloomFilter index for saler_id and category_id in the table.
+CREATE TABLE IF NOT EXISTS sale_detail_bloom (
+ sale_date date NOT NULL COMMENT "Sale data",
+ customer_id int NOT NULL COMMENT "Customer ID",
+ saler_id int NOT NULL COMMENT "Saler ID",
+ sku_id int NOT NULL COMMENT "SKU ID",
+ category_id int NOT NULL COMMENT "Category ID",
+ sale_count int NOT NULL COMMENT "Sale count",
+ sale_price DECIMAL(12,2) NOT NULL COMMENT "Sale price",
+ sale_amt DECIMAL(20,2) COMMENT "Sale amount"
+)
+Duplicate KEY(sale_date, customer_id,saler_id,sku_id,category_id)
+DISTRIBUTED BY HASH(saler_id) BUCKETS 10
+PROPERTIES (
+"bloom_filter_columns"="saler_id,category_id"
+);
```
-site -u visit (siteid, city, username, pv)
-```
-
-Siteid may lead to a low degree of data aggregation. If business parties often
base their PV needs on city statistics, they can build a city-only, PV-based
rollup:
-```
-ALTER TABLE site_visit ADD ROLLUP rollup_city(city, pv);
+:::caution
+**Restrictions**
+
+1. BloomFilter indexes are not supported for columns of type Tinyint, Float,
and Double.
+2. BloomFilter indexes can only accelerate filtering using "in" and "="
operators.
+3. BloomFilter indexes must be built on high-cardinality columns (above 5000)
in query conditions that involve "in" or "=" operators.
+ 1. BloomFilter indexes are suitable for non-prefix filtering.
+ 2. Queries will filter based on the high-frequency values in the column,
and the filtering conditions are mostly "in" and "=".
+ 3. Unlike Bitmap indexes, BloomFilter indexes are suitable for
high-cardinality columns, such as UserID. If created on low-cardinality columns
like "gender", each block will contain almost all values, rendering the
BloomFilter index meaningless.
+ 4. It is suitable for cases with data cardinality around half of the total
range.
+ 5. For high-cardinality columns with equality (=) queries, such as ID
numbers, using Bitmap indexes can greatly accelerate performance.
+:::
+
+### **Ngram BloomFilter index**
+
+Since 2.0.0, Doris has introduced the NGram BloomFilter index to improve the
performance of "**LIKE**" queries.
+
+**Best practice**
+
+```SQL
+-- Example of creating NGram BloomFilter index in table creation statement
+CREATE TABLE `nb_table` (
+ `siteid` int(11) NULL DEFAULT "10" COMMENT "",
+ `citycode` smallint(6) NULL COMMENT "",
+ `username` varchar(32) NULL DEFAULT "" COMMENT "",
+ INDEX idx_ngrambf (`username`) USING NGRAM_BF PROPERTIES("gram_size"="3",
"bf_size"="256") COMMENT 'username ngram_bf index'
+) ENGINE=OLAP
+AGGREGATE KEY(`siteid`, `citycode`, `username`) COMMENT "OLAP"
+DISTRIBUTED BY HASH(`siteid`) BUCKETS 10;
+
+-- PROPERTIES("gram_size"="3", "bf_size"="256"), representing the number of
grams and the byte size of the BloomFilter
+-- The number of grams is determined according to the query cases and is
typically set to the length of the majority of query strings. The number of
bytes in the BloomFilter can be determined after testing. Generally, a larger
number of bytes leads to better filtering results, and it is recommended to
start with a value of 256 for testing and evaluating the effectiveness.
However, it's important to note that a larger number of bytes also increases
the storage cost of the index.
+-- With high data cardinality, there is no need to set a large BloomFilter
size. Conversely, with low data cardinality, increase the BloomFilter size to
enhance filtering efficiency.
```
-1.5.2. The prefix index in Base Table cannot be hit
+:::caution
+**Restrictions**
-Generally, the way Base Table is constructed cannot cover all query modes. At
this point, you can consider adjusting the column order and establishing Rollup.
+1. NGram BloomFilter index only supports string columns.
+2. NGram BloomFilter indexes and BloomFilter indexes are mutually exclusive,
meaning that only one of them can be set for the same column.
+3. The sizes of the NGram and the BloomFilter can both be optimized based on
the actual situation. If the NGram size is relatively small, you may increase
the BloomFilter size.
+4. For data at the scale of billions or above, if there is a need for fuzzy
matching, it is recommended to use inverted indexes or NGram BloomFilter.
+:::
-Database Session
+### **Bitmap index**
-```
-session -u data (visitorid, sessionid, visittime, city, province, ip, browser,
url)
-```
+To accelerate data queries, Doris supports users in adding Bitmap indexes to
certain fields. This is suitable for scenarios involving equivalence or range
queries on columns with lower cardinality.
-In addition to visitorid analysis, there are browser and province analysis
cases, Rollup can be established separately.
+**Best practice**
+```SQL
+-- Example: create Bitmap index for siteid on bitmap_table
+CREATE INDEX [IF NOT EXISTS] bitmap_index_name ON
+bitmap_table (siteid)
+USING BITMAP COMMENT 'bitmap_siteid';
```
-ALTER TABLE session_data ADD ROLLUP rollup_browser(browser,province,ip,url)
DUPLICATE KEY(browser,province);
-```
-
-## Schema Change
-
-Users can modify the Schema of an existing table through the Schema Change
operation, currently Doris supports the following modifications:
-- Adding and deleting columns
-- Modify column types
-- Reorder columns
-- Adding or removing index
+:::caution
+**Restrictions**
+
+1. Bitmap indexes can only be created on a single column.
+2. Bitmap indexes can be applied to all columns in the `Duplicate` and
`Unique` Key models, as well as the key columns in the `Aggregate` Key model.
+3. Bitmap indexes support the following data types:
+ 1. `TINYINT`
+ 2. `SMALLINT`
+ 3. `INT`
+ 4. `BIGINT`
+ 5. `CHAR`
+ 6. `VARCHAR`
+ 7. `DATE`
+ 8. `DATETIME`
+ 9. `LARGEINT`
+ 10. `DECIMAL`
+ 11. `BOOL`
+4. Bitmap indexes only take effect under Segment V2. The storage format of
tables with Bitmap indexes will be automatically converted to V2 format by
default.
+5. Bitmap indexes should be constructed within a certain cardinality range. It
is not suitable for extremely high or low cardinality cases.
+ 1. It is recommended for columns with a cardinality between 100 and
100,000, such as the occupation field or city field. If the duplication rate is
too high, there won't be significant advantages to build Bitmap indexes
compared to other types of indexes. If the duplication rate is too low, Bitmap
indexes can significantly reduce space efficiency and performance. Specific
types of queries, such as count, OR, and AND operations, only require bitwise
operations.
+ 2. Bitmap indexes are more suitable for orthogonal queries.
+:::
+
+## Field type
+
+Doris supports various field types, including precise deduplication with
BITMAP, fuzzy deduplication with HLL, semi-structured data types such as
ARRAY/MAP/JSON, as well as common numeric, string, and time types.
+
+:::tip
+**Recommendations**
+
+1. VARCHAR
+ 1. Variable-length string with a length range of 1-65533 bytes. It is
stored in UTF-8 encoding, where English characters typically occupy 1 byte.
+ 2. There is often a misunderstanding about the performance difference
between varchar(255) and varchar(65533). If the data stored in both cases is
the same, the performance will be the same as well. When creating a table, if
you are unsure about the maximum length of the field, it is recommended to use
varchar(65533) to prevent import errors caused by excessively long strings.
+2. STRING
+ 1. Variable-length string with a default size of 1048576 bytes (1MB), which
can be increased to 2147483643 bytes (2GB). It is stored in UTF-8 encoding,
where English characters typically occupy 1 byte.
+ 2. It can only be used in value columns but not key columns or partitioning
columns.
+ 3. It is suitable for storing large text content. However, if such a
requirement does not exist, it is recommended to use VARCHAR. STRING columns
have limitations as they cannot be used in key columns or partitioning columns.
+3. Numeric fields: Choose the appropriate data type based on the required
precision. There is no special restrictions on this.
+4. Time fields: Note that if there is a high precision requirement (timestamp
accurate to milliseconds), you need to specify the use of datetime(6).
Otherwise, such timestamps are not supported by default.
+5. It is recommended to use the JSON data type instead of string type for
storing JSON data.
+:::
+
+## Create table
+
+
+
+Considerations in creating a table include the setting of data partitions and
buckets in addition to data model, index, and field types.
+
+**Best practice**
+
+```SQL
+-- Take Merge-on-Write tables in the Unique Key model as an example:
+-- Merge-on-Write in the Unique Key model is implemented in a different way
from the Aggregate Key model. The performance of it is similar to that on the
Duplicate Key model.
+-- In use cases requiring primary key constraints, the Aggregate Key model can
deliver much better query performance compared to the Duplicate Key model,
especially in aggregate queries and queries that involve filtering a large
amount of data using indexes.
+
+-- For non-partitioned tables
+CREATE TABLE IF NOT EXISTS tbl_unique_merge_on_write
+(
+ `user_id` LARGEINT NOT NULL COMMENT "Use ID",
+ `username` VARCHAR(50) NOT NULL COMMENT "Username",
+ `register_time` DATE COMMENT "User registration time",
+ `city` VARCHAR(20) COMMENT "User city",
+ `age` SMALLINT COMMENT "User age",
+ `sex` TINYINT COMMENT "User gender",
+ `phone` LARGEINT COMMENT "User phone number",
+ `address` VARCHAR(500) COMMENT "User address"
+)
+UNIQUE KEY(`user_id`, `username`)
+-- Data volume of 3~5G
+DISTRIBUTED BY HASH(`user_id`) BUCKETS 10
+PROPERTIES (
+-- In Doris 1.2.0, as a new feature, Merge-on-Write is disabled by default.
Users can enable it by adding the following property.
+"enable_unique_key_merge_on_write" = "true"
+);
+
+-- For partitioned tables
+CREATE TABLE IF NOT EXISTS tbl_unique_merge_on_write_p
+(
+ `user_id` LARGEINT NOT NULL COMMENT "Use ID",
+ `username` VARCHAR(50) NOT NULL COMMENT "Username",
+ `register_time` DATE COMMENT "User registration time",
+ `city` VARCHAR(20) COMMENT "User city",
+ `age` SMALLINT COMMENT "User age",
+ `sex` TINYINT COMMENT "User gender",
+ `phone` LARGEINT COMMENT "User phone number",
+ `address` VARCHAR(500) COMMENT "User address"
+)
+UNIQUE KEY(`user_id`, `username`, `register_time`)
+PARTITION BY RANGE(`register_time`) (
+ PARTITION p00010101_1899 VALUES [('0001-01-01'), ('1900-01-01')),
+ PARTITION p19000101 VALUES [('1900-01-01'), ('1900-01-02')),
+ PARTITION p19000102 VALUES [('1900-01-02'), ('1900-01-03')),
+ PARTITION p19000103 VALUES [('1900-01-03'), ('1900-01-04')),
+ PARTITION p19000104_1999 VALUES [('1900-01-04'), ('2000-01-01')),
+ FROM ("2000-01-01") TO ("2022-01-01") INTERVAL 1 YEAR,
+ PARTITION p30001231 VALUES [('3000-12-31'), ('3001-01-01')),
+ PARTITION p99991231 VALUES [('9999-12-31'), (MAXVALUE))
+)
+-- Data volume of 3~5G
+DISTRIBUTED BY HASH(`user_id`) BUCKETS 10
+PROPERTIES (
+-- In Doris 1.2.0, as a new feature, Merge-on-Write is disabled by default.
Users can enable it by adding the following property.
+"enable_unique_key_merge_on_write" = "true",
+-- The unit for dynamic partition scheduling can be specified as HOUR, DAY,
WEEK, MONTH, or YEAR.
+"dynamic_partition.time_unit" = "MONTH",
+-- The starting offset for dynamic partitioning is specified as a negative
number. Depending on the value of the time_unit, it uses the current day
(week/month) as the reference point, partitions prior to this offset will be
deleted (TTL). If not specified, the default value is -2147483648, indicating
that historical partitions will not be deleted.
+"dynamic_partition.start" = "-3000",
+-- The ending offset for dynamic partitioning is specified as a positive
number. Depending on the value of the time_unit, it uses the current day
(week/month) as the reference point. Create the corresponding partitions of the
specified range in advance.
+"dynamic_partition.end" = "10",
+-- The prefix for names of the dynamically created partitions (required).
+"dynamic_partition.prefix" = "p",
+-- The number of buckets corresponding to the dynamically created partitions.
+"dynamic_partition.buckets" = "10",
+"dynamic_partition.enable" = "true",
+-- The following is the number of replicas corresponding to dynamically
created partitions. If not specified, the default value will be the replication
factor specified when creating the table, which is typically 3.
+"dynamic_partition.replication_num" = "3",
+"replication_num" = "3"
+);
+
+-- View existing partitions
+-- The actual number of created partitions is determined by a combination of
dynamic_partition.start, dynamic_partition.end, and the settings of PARTITION
BY RANGE.
+show partitions from tbl_unique_merge_on_write_p;
+```
-For details, please refer to [Schema
Change](../advanced/alter-table/schema-change.md)
\ No newline at end of file
+:::tip
+**Recommendations**
+
+1. The database name should be in lowercase, separated by underscores (_) with
a maximum length of 62 bytes.
+2. Table names are case-sensitive and should be in lowercase, separated by
underscores (_) with a maximum length of 64 bytes.
+3. Manual bucketing is preferred over auto bucketing. It is recommended to
partition and bucket your data based on your specific data volume, as this can
improve data writing and query performance. Auto bucketing can result in a
large number of tablets and an abundance of small files.
+4. For data volumes ranging from 10 million to 200 million rows, you can
directly set a bucketing strategy and skip the partition configurations. (Doris
internally has default partitions for the table).
+5. In **time-series scenarios**, it is recommended to include
`compaction_policy" = "time_series` in the table properties when creating the
table. This effectively reduces the write amplification of compaction,
especially in continuous data ingestion scenarios. Note that it should be used
in conjunction with inverted indexes.
+:::
+
+
+:::caution
+**Restrictions**
+
+1. The database character set should be specified as UTF-8 since only UTF-8 is
supported.
+2. The replication factor for tables must be 3 (if not specified, it defaults
to 3).
+3. The data volume of an individual tablet (**Tablet Count = Partition Count
\* Bucket Count \* Replication Factor**) theoretically has no upper or lower
bounds, except for small tables (in the range of hundreds of megabytes to one
gigabyte), where it should be ensured to be within the range of 1 GB to 10 GB:
+ 1. If the data volume of an individual tablet is too small, it can result
in poor data aggregation performance and increased metadata management overhead.
+ 2. If the data volume is too large, it hinders replica migration,
synchronization, and increases the cost of schema changes or materialization
operations (these operations are retried at the granularity of a tablet).
+4. For data exceeding 500 million records, **partitioning and bucketing**
strategies must be implemented:
+ 1. **Recommendations for bucketing:**
+ 1. For large tables, each tablet should be in the range of 1 GB to 10 GB
to prevent the generation of too many small files.
+ 2. For dimension tables of approximately 100 megabytes, the number of
tablets should be controlled within the range of 3 to 5. This ensures a certain
level of concurrency without generating excessive small files.
+ 2. If partitioning is not feasible and the data grows rapidly without the
possibility of dynamic time-based partitioning, it is advisable to increase the
number of buckets to accommodate the data volume based on the data retention
period (180 days). It is still recommended to keep the size of each bucket
between 1 GB and 10 GB.
+ 3. Apply salting to the bucketing field and use the same salting strategy
for queries in order to leverage bucket pruning capabilities.
+ 4. Random bucketing:
+ 1. If an OLAP table does not have fields that need to be updated,
setting the data bucketing mode to RANDOM can avoid severe data skew. During
data ingestion, each batch of data is randomly assigned to a tablet for writing.
+ 2. When the bucketing mode for a table is set to RANDOM, since there is
no bucketing column, querying the table will scan all buckets in the hit
partitions instead of querying specific buckets based on the values of the
bucketing column. This setting is suitable for overall aggregation and analysis
queries rather than high-concurrency point queries.
+ 3. If an OLAP table has a random distribution of data, setting the
`load_to_single_tablet` parameter to true during data ingestion allows each
task to write to a single tablet. This improves concurrency and throughput
during large-scale data ingestion. It can also reduce the write amplification
caused by data ingestion and compaction and ensure cluster stability.
+ 5. Dimension tables, which grow slowly, can use a single partition and
apply bucketing based on commonly used query conditions (where the data
distribution of the bucketing field is relatively even).
+ 6. Fact tables.
+5. If the bucketing field exhibits more than 30% data skew, it is recommended
to avoid using the Hash bucketing strategy and instead use the RANDOM bucketing
strategy.
+6. Dynamic partitioning should not be used for tables of a data size less than
20 million rows. Because the dynamic partitioning method automatically creates
partitions, and for small tables, it might unnecessarily create lots of
partitions and buckets).
+7. For scenarios where there is a large amount of historical partitioned data
but the historical data is relatively small, unbalanced, or queried
infrequently, you can use the following approach to place the data in special
partitions. You can create historical partitions for historical data of small
sizes (e.g., yearly partitions, monthly partitions). For example, you can
create historical partitions for data `FROM ("2000-01-01") TO ("2022-01-01")
INTERVAL 1 YEAR`:
+ 1. (
+
+ 2. PARTITION p00010101_1899 VALUES [('0001-01-01'), ('1900-01-01')),
+
+ 3. PARTITION p19000101 VALUES [('1900-01-01'), ('1900-01-02')),
+
+ 4. ...
+
+ 5. PARTITION p19000104_1999 VALUES [('1900-01-04'), ('2000-01-01')),
+
+ 6. FROM ("2000-01-01") TO ("2022-01-01") INTERVAL 1 YEAR,
+
+ 7. PARTITION p30001231 VALUES [('3000-12-31'), ('3001-01-01')),
+
+ 8. PARTITION p99991231 VALUES [('9999-12-31'), (MAXVALUE))
+
+ 9. )
+8. The number of single-table materialized views should not exceed 6:
+ 1. Single-table materialized views are built in real time.
+ 2. On the Unique Key model, materialized views can only be used to reorder
keys and cannot be used for data aggregation, as the aggregation model for the
Unique Key model is Replace.
+:::
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]