[doris-website] branch master updated: Add three blogs (#279)

luzhijing Wed, 02 Aug 2023 03:00:45 -0700

This is an automated email from the ASF dual-hosted git repository.

luzhijing pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git



The following commit(s) were added to refs/heads/master by this push:
     new 74602a782dc Add three blogs (#279)
74602a782dc is described below

commit 74602a782dcfa85bf74859a215100b8b62608200
Author: Hu Yanjun <[email protected]>
AuthorDate: Wed Aug 2 18:00:33 2023 +0800

    Add three blogs (#279)
---
 blog/Data_Update.md            | 226 +++++++++++++++++++++++++++++++++++++++++
 blog/Poly.md                   | 105 +++++++++++++++++++
 blog/Zhihu.md                  | 140 +++++++++++++++++++++++++
 static/images/Dataupdate_1.png | Bin 0 -> 146321 bytes
 static/images/Dataupdate_2.png | Bin 0 -> 40204 bytes
 static/images/Dataupdate_3.png | Bin 0 -> 30428 bytes
 static/images/Poly_1.png       | Bin 0 -> 286095 bytes
 static/images/Poly_2.png       | Bin 0 -> 413181 bytes
 static/images/Poly_3.png       | Bin 0 -> 316024 bytes
 static/images/Zhihu_1.png      | Bin 0 -> 176681 bytes
 static/images/Zhihu_2.png      | Bin 0 -> 139038 bytes
 static/images/Zhihu_3.png      | Bin 0 -> 192461 bytes
 static/images/Zhihu_4.png      | Bin 0 -> 158507 bytes
 static/images/Zhihu_5.png      | Bin 0 -> 65366 bytes
 static/images/Zhihu_6.png      | Bin 0 -> 49858 bytes
 static/images/Zhihu_7.png      | Bin 0 -> 96180 bytes
 static/images/Zhihu_8.png      | Bin 0 -> 80053 bytes
 17 files changed, 471 insertions(+)

diff --git a/blog/Data_Update.md b/blog/Data_Update.md
new file mode 100644
index 00000000000..d8bbd04511f
--- /dev/null
+++ b/blog/Data_Update.md
@@ -0,0 +1,226 @@
+---
+{
+    'title': 'Is Your Latest Data Really the Latest? Check the Data Update 
Mechanism of Your Database',
+    'summary': "This is about how to support both row update and partial 
column update in a database in a way that is simple in execution and efficient 
in data quality guarantee.",
+    'date': '2023-07-24',
+    'author': 'Apache Doris',
+    'tags': ['Tech Sharing'],
+}
+
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+In databases, data update is to add, delete, or modify data. Timely data 
update is an important part of high quality data services.
+
+Technically speaking, there are two types of data updates: you either update a 
whole row (**Row Update**) or just update part of the columns (**Partial Column 
Update**). Many databases supports both of them, but in different ways. This 
post is about one of them, which is simple in execution and efficient in data 
quality guarantee. 
+
+As an open source analytic database, Apache Doris supports both Row Update and 
Partial Column Update with one data model: the **Unique Key Model**. It is 
where you put data that doesn't need to be aggregated. In the Unique Key Model, 
you can specify one column or the combination of several columns as the Unique 
Key (a.k.a. Primary Key). For one Unique Key, there will always be one row of 
data: the newly ingested data record replaces the old. That's how data updates 
work.
+
+The idea is straightforward, but in real-life implementation, it happens that 
the latest data does not arrive the last or doesn't even get written at all, so 
I'm going to show you how Apache Doris implements data update and avoids 
messups with its Unique Key Model. 
+
+![](../static/images/Dataupdate_1.png)
+
+## Row Update
+
+For data writing to the Unique Key Model, Apache Doris adopts the **Upsert** 
semantics, which means **Update or Insert**. If the new data record includes a 
Unique Key that already exists in the table, the new record will replace the 
old record; if it includes a brand new Unique Key, the new record will be 
inserted into the table as a whole. The Upsert operation can provide high 
throughput and guarantee data reliability.
+
+**Example**:
+
+In the following table, the Unique Key is the combination of three columns: 
`user_id, date, group_id`.
+
+```SQL
+mysql> desc test_table;
++-------------+--------------+------+-------+---------+-------+
+| Field       | Type         | Null | Key   | Default | Extra |
++-------------+--------------+------+-------+---------+-------+
+| user_id     | BIGINT       | Yes  | true  | NULL    |       |
+| date        | DATE         | Yes  | true  | NULL    |       |
+| group_id    | BIGINT       | Yes  | true  | NULL    |       |
+| modify_date | DATE         | Yes  | false | NULL    | NONE  |
+| keyword     | VARCHAR(128) | Yes  | false | NULL    | NONE  |
++-------------+--------------+------+-------+---------+-------+
+```
+
+Execute `insert into` to write in a data record. Since the table was empty, by 
the Upsert semantics, it means to add a new row to the table.
+
+```SQL
+mysql> insert into test_table values (1, "2023-04-28", 2, "2023-04-28", "foo");
+Query OK, 1 row affected (0.05 sec)
+{'label':'insert_2fb45d1833db4348_b612b8791c97b467', 'status':'VISIBLE', 
'txnId':'343'}
+
+mysql> select * from test_table;
++---------+------------+----------+-------------+---------+
+| user_id | date       | group_id | modify_date | keyword |
++---------+------------+----------+-------------+---------+
+|       1 | 2023-04-28 |        2 | 2023-04-28  | foo     |
++---------+------------+----------+-------------+---------+
+```
+
+Then insert two more data records, one of which has the same Unique Key with 
the previously inserted row. Now, by the Upsert semantics, it means to replace 
the old row with the new one of the same Unique Key, and insert the record of 
the new Unique Key.
+
+```SQL
+mysql> insert into test_table values (1, "2023-04-28", 2, "2023-04-29", 
"foo"), (2, "2023-04-29", 2, "2023-04-29", "bar");
+Query OK, 2 rows affected (0.04 sec)
+{'label':'insert_7dd3954468aa4ac1_a63a3852e3573b4c', 'status':'VISIBLE', 
'txnId':'344'}
+
+mysql> select * from test_table;
++---------+------------+----------+-------------+---------+
+| user_id | date       | group_id | modify_date | keyword |
++---------+------------+----------+-------------+---------+
+|       2 | 2023-04-29 |        2 | 2023-04-29  | bar     |
+|       1 | 2023-04-28 |        2 | 2023-04-29  | foo     |
++---------+------------+----------+-------------+---------+
+```
+
+## Partial Column Update
+
+Besides row update, under many circumstances, data analysts require the 
convenience of partial column update. For example, in user portraits, they 
would like to update certain dimensions of their users in real time. Or, if 
they need to maintain a flat table that is made of data from various source 
tables, they will prefer partial columm update than complicated join operations 
as a way of data update. 
+
+Apache Doris supports partial column update with the UPDATE statement. It 
filters the rows that need to be modified, read them, changes a few values, and 
write the rows back to the table. 
+
+**Example**:
+
+Suppose that there is an order table, in which the Order ID is the Unique Key.
+
+```SQL
++----------+--------------+-----------------+
+| order_id | order_amount | order_status    |
++----------+--------------+-----------------+
+| 1        |          100 | Payment Pending |
++----------+--------------+-----------------+
+1 row in set (0.01 sec)
+```
+
+When the buyer completes the payment, Apache Doris should change the order 
status of Order ID 1 from "Payment Pending" to "Delivery Pending". This is when 
the Update command comes into play.
+
+```SQL
+mysql> UPDATE test_order SET order_status = 'Delivery Pending' WHERE order_id 
= 1;
+Query OK, 1 row affected (0.11 sec)
+{'label':'update_20ae22daf0354fe0-b5aceeaaddc666c5', 'status':'VISIBLE', 
'txnId':'33', 'queryId':'20ae22daf0354fe0-b5aceeaaddc666c5'}
+```
+
+This is the table after updating.
+
+```SQL
++----------+--------------+------------------+
+| order_id | order_amount | order_status     |
++----------+--------------+------------------+
+| 1        |          100 | Delivery Pending |
++----------+--------------+------------------+
+1 row in set (0.01 sec)
+```
+
+The execution of the Update command consists of three steps in the system:
+
+- Step One: Read the row where Order ID = 1 (1, 100, 'Payment Pending')
+- Step Two: Modify the order status from "Payment Pending" to "Delivery 
Pending" (1, 100, 'Delivery Pending')
+- Step Three: Insert the new row into the table
+
+![](../static/images/Dataupdate_2.png)
+
+The table is in the Unique Key Model, which means for rows of the same Unique 
Key, only the last inserted one will be reserved, so this is what the table 
will finally look like:
+
+![](../static/images/Dataupdate_3.png)
+
+## Order of Data Updates
+
+So far this sounds simple, but in the actual world, data update might fail due 
to reasons such as data format errors, and thus mess up the data writing order. 
The order of data update matters more than you imagine. For example, in 
financial transactions, messed-up data writing order might lead to transaction 
data losses, errors, or duplication, which further leads to bigger problems.
+
+Apache Doris provides two options for users to guarantee that their data is 
updated in the correct order:
+
+**1. Update by the order of transaction commit** 
+
+In Apache Doris, each data ingestion task is a transaction. Each successfully 
ingested task will be given a data version and the number of data versions is 
strictly increasing. If the ingestion fails, the transaction will be rolled 
back, and no new data version will be generated.
+
+ By default, the Upsert semantics follows the order of the transaction 
commits. If there are two data ingestion tasks involving the same Unique Key, 
the first task generating data version 2 and the second, data version 3, then 
according to transaction commit order, data version 3 will replace data version 
2.
+
+**2. Update by the user-defined order**
+
+In real-time data analytics, data updates often happen in high concurrency. It 
is possible that there are multiple data ingestion tasks updating the same row, 
but these tasks are committed in unknown order, so the last saved update 
remains unknown, too.
+
+For example, these are two data updates, with "2023-04-30" and "2023-05-01" as 
the `modify_data`, respectively. If they are written into the system 
concurrently, but the "2023-05-01" one is successully committed first and the 
other later, then the "2023-04-30" record will be saved due to its higher data 
version number, but we know it is not the latest one.
+
+```Plain
+mysql> insert into test_table values (2, "2023-04-29", 2, "2023-05-01", "bbb");
+Query OK, 1 row affected (0.04 sec)
+{'label':'insert_e2daf8cea5524ee1_94e5c87e7bb74d67', 'status':'VISIBLE', 
'txnId':'345'}
+
+mysql> insert into test_table values (2, "2023-04-29", 2, "2023-04-30", "aaa");
+Query OK, 1 row affected (0.03 sec)
+{'label':'insert_ef906f685a7049d0_b135b6cfee49fb98', 'status':'VISIBLE', 
'txnId':'346'}
+
+mysql> select * from test_table;
++---------+------------+----------+-------------+---------+
+| user_id | date       | group_id | modify_date | keyword |
++---------+------------+----------+-------------+---------+
+|       2 | 2023-04-29 |        2 | 2023-04-30 | aaa     |
+|       1 | 2023-04-28 |        2 | 2023-04-29  | foo     |
++---------+------------+----------+-------------+---------+
+```
+
+That's why in high-concurrency scenarios, Apache Doris allows data update in 
user-defined order. Users can designate a column to the Sequence Column. In 
this way, the system will identity save the latest data version based on value 
in the Sequence Column.
+
+**Example:**
+
+You can designate a Sequence Column by specifying the 
`function_column.sequence_col` property upon table creation.
+
+```SQL
+CREATE TABLE test.test_table
+(
+    user_id bigint,
+    date date,
+    group_id bigint,
+    modify_date date,
+    keyword VARCHAR(128)
+)
+UNIQUE KEY(user_id, date, group_id)
+DISTRIBUTED BY HASH (user_id) BUCKETS 32
+PROPERTIES(
+    "function_column.sequence_col" = 'modify_date',
+    "replication_num" = "1",
+    "in_memory" = "false"
+);
+```
+
+Then check and see, the data record with the highest value in the Sequence 
Column will be saved:
+
+```SQL
+mysql> insert into test_table values (2, "2023-04-29", 2, "2023-05-01", "bbb");
+Query OK, 1 row affected (0.03 sec)
+{'label':'insert_3aac37ae95bc4b5d_b3839b49a4d1ad6f', 'status':'VISIBLE', 
'txnId':'349'}
+
+mysql> insert into test_table values (2, "2023-04-29", 2, "2023-04-30", "aaa");
+Query OK, 1 row affected (0.03 sec)
+{'label':'insert_419d4008768d45f3_a6912e584cf1b500', 'status':'VISIBLE', 
'txnId':'350'}
+
+mysql> select * from test_table;
++---------+------------+----------+-------------+---------+
+| user_id | date       | group_id | modify_date | keyword |
++---------+------------+----------+-------------+---------+
+|       2 | 2023-04-29 |        2 | 2023-05-01  | bbb     |
+|       1 | 2023-04-28 |        2 | 2023-04-29  | foo     |
++---------+------------+----------+-------------+---------+
+```
+
+## Conclusion
+
+Congratulations. Now you've gained an overview of how data updates are 
implemented in Apache Doris. With this knowledge, you can basically guarantee 
efficiency and accuracy of data updating. But wait, there is so much more about 
that. As Apache Doris 2.0 is going to provide more powerful Partial Column 
Update capabilities, with improved execution of the Update statement and the 
support for more complicated multi-table Join queries, I will show you how to 
take advantage of them in details [...]
+
diff --git a/blog/Poly.md b/blog/Poly.md
new file mode 100644
index 00000000000..20916cbf13b
--- /dev/null
+++ b/blog/Poly.md
@@ -0,0 +1,105 @@
+---
+{
+    'title': 'For Entry-Level Data Engineers: How to Build a Simple but Solid 
Data Architecture',
+    'summary': "This article aims to provide reference for non-tech companies 
who are seeking to empower your business with data analytics.",
+    'date': '2023-07-31',
+    'author': 'Zhenwei Liu',
+    'tags': ['Apache Doris'],
+}
+
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+For Entry-Level Data Engineers: How to Build a Simple but Solid Data 
Architecture
+
+
+
+This article aims to provide reference for non-tech companies who are seeking 
to empower your business with data analytics. You will learn the basics about 
how to build an efficient and easy-to-use data system, and I will walk you 
through every aspect of it with a use case of Apache Doris, an MPP-based 
analytic data warehouse. 
+
+## What You Need
+
+This case is about a ticketing service provider who want a data platform that 
boasts quick processing, low maintenance costs, and ease of use, and I think 
they speak for the majority of entry-level database users.
+
+A prominent feature of ticketing services is the periodic spikes in ticket 
orders, you know, before the shows go on. So from time to time, the company has 
a huge amount of new data rushing in and requires real-time processing of it, 
so they can make timely adjustments during the short sales window. But in other 
time, they don't want to spend too much energy and funds on maintaining the 
data system. Furthermore, for a beginner of digital operation who only require 
basic analytic functions [...]
+
+## Simple Architecture
+
+The building blocks of this architecture are simple. You only need Apache 
Flink and Apache Kafka for data ingestion, and Apache Doris as an analytic data 
warehouse. 
+
+![](../static/images/Poly_1.png)
+
+Connecting data sources to the data warehouse is simple, too. The key 
component, Apache Doris, supports various data loading methods to fit with 
different data sources. You can perform column mapping, transforming, and 
filtering during data loading to avoid duplicate collection of data. To ingest 
a table, users only need to add the table name to the configurations, instead 
of writing a script themselves. 
+
+## Data Update
+
+Flink CDC was found to be the optimal choice if you are looking for higher 
stability in data ingestion. It also allows you to update the dynamically 
changing tables in real time. The process includes the following steps:
+
+- Configure Flink CDC for the source MySQL database, so that it allows dynamic 
updating of the table management configurations (which you can think of as the 
"metadata").
+- Create two CDC jobs in Flink, one to capture the changed data (the Forward 
stream), the other to update the table management configurations (the Broadcast 
stream).
+- Configure all tables of the source database at the Sink end (the output end 
of Flink CDC). When there is newly added table in the source database, the 
Broadcast stream will be triggered to update the table management 
configurations. (You just need to configure the tables, instead of "creating" 
the tables.)
+
+![](../static/images/Poly_2.png)
+
+## Layering of Data Warehouse
+
+Data flows from various sources into the data warehouse, where it is cleaned 
and organized before it is ready for queries and analysis. The data processing 
here is divided into five typical layers. Such layering simplifies the data 
cleaning process because it provides a clear division of labor and makes things 
easier to locate and comprehend. 
+
+- **ODS**: This is the prep zone of the data warehouse. The unprocessed 
original data is put in the [Unique Key 
Model](https://doris.apache.org/docs/dev/data-table/data-model/#unique-model) 
of Apache Doris, which can avoid duplication of data. 
+- **DWD**: This layer cleans, formats, and de-identifies data to produce fact 
tables. Every detailed data record is preserved. Data in this layer is also put 
into the Unique Key Model.
+- **DWS**: This layer produces flat tables of a certain theme (order, user, 
etc.) based on data from the DWD layer. 
+- **ADS**: This layer auto-aggregates data, which is implemented by the 
[Aggregate Key 
Model](https://doris.apache.org/docs/dev/data-table/data-model/#aggregate-model)
 of Apache Doris.
+- **DIM**: The DIM layer accommodates dimension data (in this case, data about 
the theaters, projects, and show sessions, etc.), which is used in combination 
with the order details.
+
+After the original data goes through these layers, it is available for queries 
via one data export interface.
+
+## Reporting
+
+Like many non-tech business, the ticketing service provider needs a data 
warehouse mainly for reporting. They derive trends and patterns from all kinds 
of data reports, and then figure out ways towards efficient management and 
sales increase. Specifically, this is the information they are observing in 
their reports:
+
+- **Statistical Reporting**: These are the most frequently used reports, 
including sales reports by theater, distribution channel, sales representative, 
and show.
+- **Agile Reporting**: These are reports developed for specific purposes, such 
as daily and weekly project data reports, sales summary reports, GMV reports, 
and settlement reports.
+- **Data Analysis**: This involves data such as membership orders, attendance 
rates, and user portraits.
+- **Dashboarding**: This is to visually display sales data.
+
+![](../static/images/Poly_3.png)
+
+These are all entry-level tasks in data analytics. One of the biggest burdens 
for the data engineers was to quickly develop new reports as the internal 
analysts required. The [Aggregate Key 
Model](https://doris.apache.org/docs/dev/data-table/data-model#aggregate-model) 
of Apache Doris is designed for this. 
+
+### Quick aggregation to produce reports on demand
+
+For example, supposing that analysts want a sales report by sales 
representatives, data engineers can produce that by simple configuration:
+
+1. Put the original data in the Aggregate Key Model
+2. Specify the sales representative ID column and the payment date column as 
the Key columns, and the order amount column as the Value column
+
+Then, order amounts of the same sale representative within the specified 
period of time will be auto-aggregated. Bam! That's the report you need! 
+
+According to the user, this whole process only takes them 10~30 minutes, 
depending on the complexity of the report required. So the Aggregate Key Model 
largely releases data engineers from the pressure of report development.
+
+### Quick response to data queries
+
+Most data analysts would just want their target data to be returned the second 
they need it. In this case, the user often leverages two capabilities of Apache 
Doris to realize quick query response.
+
+Firstly, Apache Doris is famously fast in Join queries. So if you need to 
extract information across multiple tables, you are in good hands. Secondly, in 
data analysis, it often happens that analysts frequently input the same 
request. For example, they frequently want to check the sales data of different 
theaters. In this scenario, Apache Doris allows you to create a [Materialized 
View](https://doris.apache.org/docs/dev/query-acceleration/materialized-view/), 
which means you pre-aggregat [...]
+
+## Conclusion
+
+This is the overview of a simple data architecture and how it can provide the 
data services you need. It ensures data ingestion stability and quality with 
Flink CDC, and quick data analysis with Apache Doris. The deployment of this 
architecture is simple, too. If you plan for a data analytic upgrade for your 
business, you might refer to this case. If you need advice and help, you may 
join our [community here](https://t.co/ZxJuNJHXb2).
diff --git a/blog/Zhihu.md b/blog/Zhihu.md
new file mode 100644
index 00000000000..9bef297d53a
--- /dev/null
+++ b/blog/Zhihu.md
@@ -0,0 +1,140 @@
+---
+{
+    'title': 'Database Dissection: How Fast Data Queries Are Implemented',
+    'summary': "What's more important than quick performance itself is the 
architectural design and mechanism that enable it.",
+    'date': '2023-07-16',
+    'author': 'Rong Hou',
+    'tags': ['Best Practice'],
+}
+
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+
+
+In data analytics, fast query performance is more of a result than a 
guarantee. What's more important than the result itself is the architectural 
design and mechanism that enables quick performance. This is exactly what this 
post is about. I will put you into context with a typical use case of Apache 
Doris, an open-source MPP-based analytic database.
+
+The user in this case is an all-category Q&A website. As a billion-dollar 
listed company, they have their own data management platform. What Doris does 
is to support the data filtering, packaging, analyzing, and monitoring 
workloads of that platform. Based on their huge data size, the user demands 
quick data loading and quick response to queries. 
+
+## How to Enable Quick Queries on Huge Dataset
+
+- **Scenario**: user segmentation for the website
+- **Data size**: 100 billion data objects, 2.4 million tags
+- **Requirements**: query response time < 1 second; result packaging < 10 
seconds
+
+For these goals, the engineers have made three critical changes in their data 
processing pipeline.
+
+### 1.Distribute the data
+
+User segmentation is when analysts pick out a group of website users that 
share certain characteristics (tags). In the database system, this process is 
implemented by a bunch of set operations (union, intersection, and difference). 
+
+**Narration from the engineers:**
+
+We realize that instead of executing set operations on one big dataset, we can 
divide our dataset into smaller ones, execute set operations on each of them, 
and then merge all the results. In this way, each small dataset is computed by 
one thread/queue. Then we have a queue to do the final merging. It's simple 
distributed computing thinking.
+
+![](../static/images/Zhihu_1.png)
+
+Example:
+
+1. Every 1 million users are put into one group with a `group_id`.
+2. All user tags in that same group will relate to the corresponding 
`group_id`.
+3. Calculate the union/intersection/difference within each group. (Enable 
multi-thread mode to increase computation efficiency.)
+4. Merge the results from the groups.
+
+The problem here is, since user tags are randomly distributed across various 
machines, the computation entails multi-time shuffling, which brings huge 
network overhead. That leads to the second change.
+
+### 2.Pre-bind a data group to a machine
+
+This is enabled by the Colocate mechanism of Apache Doris. The idea of 
Colocate is to place data chunks that are often accessed together onto the same 
node, so as to reduce cross-node data transfer and thus, get lower latency.
+
+![](../static/images/Zhihu_2.png)
+
+The implementation is simple: Bind one group key to one machine. Then 
naturally, data corresponding to that group key will be pre-bound to that 
machine. 
+
+The following is the query plan before we adopted Collocate: It is 
complicated, with a lot of data shuffling.
+
+![](../static/images/Zhihu_3.png)
+
+This is the query plan after. It is much simpler, which is why queries are 
much faster and less costly.
+
+![](../static/images/Zhihu_4.png)
+
+### 3.Merge the operators
+
+In data queries, the engineers realized that they often use a couple of 
functions in combination, so they decided to develop compound functions to 
further improve execution efficiency. They came to the Doris 
[community](https://t.co/XD4uUSROft) and talked about their thoughts. The Doris 
developers provided support for them and soon the compound functions are ready 
for use on Doris. These are a few examples:
+
+```
+bitmap_and_count == bitmap_count(bitmap_and(bitmap1, bitmap2))
+bitmap_and_not_count == bitmap_count(bitmap_not(bitmap1, bitmap_and(bitmap1, 
bitmap2))
+orthogonal_bitmap_union_count==bitmap_and(bitmap1,bitmap_and(bitmap2,bitmap3)
+```
+
+Query execution with one compound function is much faster than that with a 
chain of simple functions, as you can tell from the lengths of the flow charts:
+
+![](../static/images/Zhihu_5.png)
+
+- **Multiple Simple functions**: This involves three function executions and 
two intermediate storage. It's a long and slow process.
+- **One compound function**: Simple in and out.
+
+## How to Quickly Ingest Large Amounts of Data
+
+This is about putting the right workload on the right component. Apache Doris 
supports a variety of data loading methods. After trials and errors, the user 
settled on Spark Load and thus decreased their data loading time by 90%.  
+
+**Narration from the engineers:**
+
+In offline data ingestion, we used to perform most computation in Apache Hive, 
write the data files to HDFS, and pull data regularly from HDFS to Apache 
Doris. However, after Doris obtains parquet files from HDFS, it performs a 
series of operations on them before it can turn them into segment files: 
decompressing, bucketing, sorting, aggregating, and compressing. These 
workloads will be borne by Doris backends, which have to undertake a few bitmap 
operations at the same time. So there is [...]
+
+![](../static/images/Zhihu_6.png)
+
+So we decided on the Spark Load method. It allows us to split the ingestion 
process into two parts: computation and storage, so we can move all the 
bucketing, sorting, aggregating, and compressing to Spark clusters. Then Spark 
writes the output to HDFS, from which Doris pulls data and flushes it to the 
local disks.
+
+![](../static/images/Zhihu_7.png)
+
+When ingesting 1.2 TB data (that's 110 billion rows), the Spark Load method 
only took 55 minutes. 
+
+## A Vectorized Execution Engine
+
+In addition to the above changes, a large part of the performance of a 
database relies on its execution engine. In the case of Apache Doris, it has 
fully vectorized its storage and computation layers since version 1.1. The 
longtime user also witnessed this revolution, so we invited them to test how 
the vectorized engine worked.
+
+They compared query response time before and after the vectorization in seven 
of its frequent scenarios:
+
+- Scenario 1: Simple user segmentation (hundreds of filtering conditions), 
data packaging of a multi-million user group.
+- Scenario 2: Complicated user segmentation (thousands of filtering 
conditions), data packaging of a tens-of-million user group.
+- Scenario 3: Multi-dimensional filtering (6 dimensions), single-table query, 
**single-date flat table**, data aggregation, 180 million rows per day.
+- Scenario 4: Multi-dimensional filtering (6 dimensions), single-table query, 
**multi-date flat table**, data aggregation, 180 million rows per day.
+- Scenario 5: **Single-table query**, COUNT, 180 million rows per day.
+- Scenario 6: **Multi-table query**, (Table A: 180 million rows, SUM, COUNT; 
Table B: 1.5 million rows, bitmap aggregation), aggregate Table A and Table B, 
join them with Table C, and then join the sub-tables, six joins in total.
+- Scenario 7: Single-table query, 500 million rows of itemized data
+
+The results are as below:
+
+![](../static/images/Zhihu_8.png)
+
+## Conclusion
+
+In short, what contributed to the fast data loading and data queries in this 
case?
+
+- The Colocate mechanism that's designed for distributed computing
+- Collaboration between database users and 
[developers](https://t.co/ZxJuNJHXb2) that enables the operator merging
+- Support for a wide range of data loading methods to choose from
+- A vectorized engine that brings overall performance increase
+
+It takes efforts from both the database developers and users to make fast 
performance possible. The user's experience and knowledge of their own status 
quo will allow them to figure out the quickest path, while a good database 
design will help pave the way and make users' life easier.
\ No newline at end of file
diff --git a/static/images/Dataupdate_1.png b/static/images/Dataupdate_1.png
new file mode 100644
index 00000000000..065b36a48a1
Binary files /dev/null and b/static/images/Dataupdate_1.png differ
diff --git a/static/images/Dataupdate_2.png b/static/images/Dataupdate_2.png
new file mode 100644
index 00000000000..ec817a7327a
Binary files /dev/null and b/static/images/Dataupdate_2.png differ
diff --git a/static/images/Dataupdate_3.png b/static/images/Dataupdate_3.png
new file mode 100644
index 00000000000..62c0d37797d
Binary files /dev/null and b/static/images/Dataupdate_3.png differ
diff --git a/static/images/Poly_1.png b/static/images/Poly_1.png
new file mode 100644
index 00000000000..6906d77b767
Binary files /dev/null and b/static/images/Poly_1.png differ
diff --git a/static/images/Poly_2.png b/static/images/Poly_2.png
new file mode 100644
index 00000000000..7076b6122b9
Binary files /dev/null and b/static/images/Poly_2.png differ
diff --git a/static/images/Poly_3.png b/static/images/Poly_3.png
new file mode 100644
index 00000000000..d9c417b8a47
Binary files /dev/null and b/static/images/Poly_3.png differ
diff --git a/static/images/Zhihu_1.png b/static/images/Zhihu_1.png
new file mode 100644
index 00000000000..3f83fa7c2a3
Binary files /dev/null and b/static/images/Zhihu_1.png differ
diff --git a/static/images/Zhihu_2.png b/static/images/Zhihu_2.png
new file mode 100644
index 00000000000..50e9534e0f0
Binary files /dev/null and b/static/images/Zhihu_2.png differ
diff --git a/static/images/Zhihu_3.png b/static/images/Zhihu_3.png
new file mode 100644
index 00000000000..6a5f5e7c5f6
Binary files /dev/null and b/static/images/Zhihu_3.png differ
diff --git a/static/images/Zhihu_4.png b/static/images/Zhihu_4.png
new file mode 100644
index 00000000000..6f103e3cb76
Binary files /dev/null and b/static/images/Zhihu_4.png differ
diff --git a/static/images/Zhihu_5.png b/static/images/Zhihu_5.png
new file mode 100644
index 00000000000..c391ab19b8b
Binary files /dev/null and b/static/images/Zhihu_5.png differ
diff --git a/static/images/Zhihu_6.png b/static/images/Zhihu_6.png
new file mode 100644
index 00000000000..61e7e54dcd1
Binary files /dev/null and b/static/images/Zhihu_6.png differ
diff --git a/static/images/Zhihu_7.png b/static/images/Zhihu_7.png
new file mode 100644
index 00000000000..256d5ec177d
Binary files /dev/null and b/static/images/Zhihu_7.png differ
diff --git a/static/images/Zhihu_8.png b/static/images/Zhihu_8.png
new file mode 100644
index 00000000000..c5ad192c382
Binary files /dev/null and b/static/images/Zhihu_8.png differ


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[doris-website] branch master updated: Add three blogs (#279)

Reply via email to