This is an automated email from the ASF dual-hosted git repository.
luzhijing pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new 74602a782dc Add three blogs (#279)
74602a782dc is described below
commit 74602a782dcfa85bf74859a215100b8b62608200
Author: Hu Yanjun <[email protected]>
AuthorDate: Wed Aug 2 18:00:33 2023 +0800
Add three blogs (#279)
---
blog/Data_Update.md | 226 +++++++++++++++++++++++++++++++++++++++++
blog/Poly.md | 105 +++++++++++++++++++
blog/Zhihu.md | 140 +++++++++++++++++++++++++
static/images/Dataupdate_1.png | Bin 0 -> 146321 bytes
static/images/Dataupdate_2.png | Bin 0 -> 40204 bytes
static/images/Dataupdate_3.png | Bin 0 -> 30428 bytes
static/images/Poly_1.png | Bin 0 -> 286095 bytes
static/images/Poly_2.png | Bin 0 -> 413181 bytes
static/images/Poly_3.png | Bin 0 -> 316024 bytes
static/images/Zhihu_1.png | Bin 0 -> 176681 bytes
static/images/Zhihu_2.png | Bin 0 -> 139038 bytes
static/images/Zhihu_3.png | Bin 0 -> 192461 bytes
static/images/Zhihu_4.png | Bin 0 -> 158507 bytes
static/images/Zhihu_5.png | Bin 0 -> 65366 bytes
static/images/Zhihu_6.png | Bin 0 -> 49858 bytes
static/images/Zhihu_7.png | Bin 0 -> 96180 bytes
static/images/Zhihu_8.png | Bin 0 -> 80053 bytes
17 files changed, 471 insertions(+)
diff --git a/blog/Data_Update.md b/blog/Data_Update.md
new file mode 100644
index 00000000000..d8bbd04511f
--- /dev/null
+++ b/blog/Data_Update.md
@@ -0,0 +1,226 @@
+---
+{
+ 'title': 'Is Your Latest Data Really the Latest? Check the Data Update
Mechanism of Your Database',
+ 'summary': "This is about how to support both row update and partial
column update in a database in a way that is simple in execution and efficient
in data quality guarantee.",
+ 'date': '2023-07-24',
+ 'author': 'Apache Doris',
+ 'tags': ['Tech Sharing'],
+}
+
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+In databases, data update is to add, delete, or modify data. Timely data
update is an important part of high quality data services.
+
+Technically speaking, there are two types of data updates: you either update a
whole row (**Row Update**) or just update part of the columns (**Partial Column
Update**). Many databases supports both of them, but in different ways. This
post is about one of them, which is simple in execution and efficient in data
quality guarantee.
+
+As an open source analytic database, Apache Doris supports both Row Update and
Partial Column Update with one data model: the **Unique Key Model**. It is
where you put data that doesn't need to be aggregated. In the Unique Key Model,
you can specify one column or the combination of several columns as the Unique
Key (a.k.a. Primary Key). For one Unique Key, there will always be one row of
data: the newly ingested data record replaces the old. That's how data updates
work.
+
+The idea is straightforward, but in real-life implementation, it happens that
the latest data does not arrive the last or doesn't even get written at all, so
I'm going to show you how Apache Doris implements data update and avoids
messups with its Unique Key Model.
+
+
+
+## Row Update
+
+For data writing to the Unique Key Model, Apache Doris adopts the **Upsert**
semantics, which means **Update or Insert**. If the new data record includes a
Unique Key that already exists in the table, the new record will replace the
old record; if it includes a brand new Unique Key, the new record will be
inserted into the table as a whole. The Upsert operation can provide high
throughput and guarantee data reliability.
+
+**Example**:
+
+In the following table, the Unique Key is the combination of three columns:
`user_id, date, group_id`.
+
+```SQL
+mysql> desc test_table;
++-------------+--------------+------+-------+---------+-------+
+| Field | Type | Null | Key | Default | Extra |
++-------------+--------------+------+-------+---------+-------+
+| user_id | BIGINT | Yes | true | NULL | |
+| date | DATE | Yes | true | NULL | |
+| group_id | BIGINT | Yes | true | NULL | |
+| modify_date | DATE | Yes | false | NULL | NONE |
+| keyword | VARCHAR(128) | Yes | false | NULL | NONE |
++-------------+--------------+------+-------+---------+-------+
+```
+
+Execute `insert into` to write in a data record. Since the table was empty, by
the Upsert semantics, it means to add a new row to the table.
+
+```SQL
+mysql> insert into test_table values (1, "2023-04-28", 2, "2023-04-28", "foo");
+Query OK, 1 row affected (0.05 sec)
+{'label':'insert_2fb45d1833db4348_b612b8791c97b467', 'status':'VISIBLE',
'txnId':'343'}
+
+mysql> select * from test_table;
++---------+------------+----------+-------------+---------+
+| user_id | date | group_id | modify_date | keyword |
++---------+------------+----------+-------------+---------+
+| 1 | 2023-04-28 | 2 | 2023-04-28 | foo |
++---------+------------+----------+-------------+---------+
+```
+
+Then insert two more data records, one of which has the same Unique Key with
the previously inserted row. Now, by the Upsert semantics, it means to replace
the old row with the new one of the same Unique Key, and insert the record of
the new Unique Key.
+
+```SQL
+mysql> insert into test_table values (1, "2023-04-28", 2, "2023-04-29",
"foo"), (2, "2023-04-29", 2, "2023-04-29", "bar");
+Query OK, 2 rows affected (0.04 sec)
+{'label':'insert_7dd3954468aa4ac1_a63a3852e3573b4c', 'status':'VISIBLE',
'txnId':'344'}
+
+mysql> select * from test_table;
++---------+------------+----------+-------------+---------+
+| user_id | date | group_id | modify_date | keyword |
++---------+------------+----------+-------------+---------+
+| 2 | 2023-04-29 | 2 | 2023-04-29 | bar |
+| 1 | 2023-04-28 | 2 | 2023-04-29 | foo |
++---------+------------+----------+-------------+---------+
+```
+
+## Partial Column Update
+
+Besides row update, under many circumstances, data analysts require the
convenience of partial column update. For example, in user portraits, they
would like to update certain dimensions of their users in real time. Or, if
they need to maintain a flat table that is made of data from various source
tables, they will prefer partial columm update than complicated join operations
as a way of data update.
+
+Apache Doris supports partial column update with the UPDATE statement. It
filters the rows that need to be modified, read them, changes a few values, and
write the rows back to the table.
+
+**Example**:
+
+Suppose that there is an order table, in which the Order ID is the Unique Key.
+
+```SQL
++----------+--------------+-----------------+
+| order_id | order_amount | order_status |
++----------+--------------+-----------------+
+| 1 | 100 | Payment Pending |
++----------+--------------+-----------------+
+1 row in set (0.01 sec)
+```
+
+When the buyer completes the payment, Apache Doris should change the order
status of Order ID 1 from "Payment Pending" to "Delivery Pending". This is when
the Update command comes into play.
+
+```SQL
+mysql> UPDATE test_order SET order_status = 'Delivery Pending' WHERE order_id
= 1;
+Query OK, 1 row affected (0.11 sec)
+{'label':'update_20ae22daf0354fe0-b5aceeaaddc666c5', 'status':'VISIBLE',
'txnId':'33', 'queryId':'20ae22daf0354fe0-b5aceeaaddc666c5'}
+```
+
+This is the table after updating.
+
+```SQL
++----------+--------------+------------------+
+| order_id | order_amount | order_status |
++----------+--------------+------------------+
+| 1 | 100 | Delivery Pending |
++----------+--------------+------------------+
+1 row in set (0.01 sec)
+```
+
+The execution of the Update command consists of three steps in the system:
+
+- Step One: Read the row where Order ID = 1 (1, 100, 'Payment Pending')
+- Step Two: Modify the order status from "Payment Pending" to "Delivery
Pending" (1, 100, 'Delivery Pending')
+- Step Three: Insert the new row into the table
+
+
+
+The table is in the Unique Key Model, which means for rows of the same Unique
Key, only the last inserted one will be reserved, so this is what the table
will finally look like:
+
+
+
+## Order of Data Updates
+
+So far this sounds simple, but in the actual world, data update might fail due
to reasons such as data format errors, and thus mess up the data writing order.
The order of data update matters more than you imagine. For example, in
financial transactions, messed-up data writing order might lead to transaction
data losses, errors, or duplication, which further leads to bigger problems.
+
+Apache Doris provides two options for users to guarantee that their data is
updated in the correct order:
+
+**1. Update by the order of transaction commit**
+
+In Apache Doris, each data ingestion task is a transaction. Each successfully
ingested task will be given a data version and the number of data versions is
strictly increasing. If the ingestion fails, the transaction will be rolled
back, and no new data version will be generated.
+
+ By default, the Upsert semantics follows the order of the transaction
commits. If there are two data ingestion tasks involving the same Unique Key,
the first task generating data version 2 and the second, data version 3, then
according to transaction commit order, data version 3 will replace data version
2.
+
+**2. Update by the user-defined order**
+
+In real-time data analytics, data updates often happen in high concurrency. It
is possible that there are multiple data ingestion tasks updating the same row,
but these tasks are committed in unknown order, so the last saved update
remains unknown, too.
+
+For example, these are two data updates, with "2023-04-30" and "2023-05-01" as
the `modify_data`, respectively. If they are written into the system
concurrently, but the "2023-05-01" one is successully committed first and the
other later, then the "2023-04-30" record will be saved due to its higher data
version number, but we know it is not the latest one.
+
+```Plain
+mysql> insert into test_table values (2, "2023-04-29", 2, "2023-05-01", "bbb");
+Query OK, 1 row affected (0.04 sec)
+{'label':'insert_e2daf8cea5524ee1_94e5c87e7bb74d67', 'status':'VISIBLE',
'txnId':'345'}
+
+mysql> insert into test_table values (2, "2023-04-29", 2, "2023-04-30", "aaa");
+Query OK, 1 row affected (0.03 sec)
+{'label':'insert_ef906f685a7049d0_b135b6cfee49fb98', 'status':'VISIBLE',
'txnId':'346'}
+
+mysql> select * from test_table;
++---------+------------+----------+-------------+---------+
+| user_id | date | group_id | modify_date | keyword |
++---------+------------+----------+-------------+---------+
+| 2 | 2023-04-29 | 2 | 2023-04-30 | aaa |
+| 1 | 2023-04-28 | 2 | 2023-04-29 | foo |
++---------+------------+----------+-------------+---------+
+```
+
+That's why in high-concurrency scenarios, Apache Doris allows data update in
user-defined order. Users can designate a column to the Sequence Column. In
this way, the system will identity save the latest data version based on value
in the Sequence Column.
+
+**Example:**
+
+You can designate a Sequence Column by specifying the
`function_column.sequence_col` property upon table creation.
+
+```SQL
+CREATE TABLE test.test_table
+(
+ user_id bigint,
+ date date,
+ group_id bigint,
+ modify_date date,
+ keyword VARCHAR(128)
+)
+UNIQUE KEY(user_id, date, group_id)
+DISTRIBUTED BY HASH (user_id) BUCKETS 32
+PROPERTIES(
+ "function_column.sequence_col" = 'modify_date',
+ "replication_num" = "1",
+ "in_memory" = "false"
+);
+```
+
+Then check and see, the data record with the highest value in the Sequence
Column will be saved:
+
+```SQL
+mysql> insert into test_table values (2, "2023-04-29", 2, "2023-05-01", "bbb");
+Query OK, 1 row affected (0.03 sec)
+{'label':'insert_3aac37ae95bc4b5d_b3839b49a4d1ad6f', 'status':'VISIBLE',
'txnId':'349'}
+
+mysql> insert into test_table values (2, "2023-04-29", 2, "2023-04-30", "aaa");
+Query OK, 1 row affected (0.03 sec)
+{'label':'insert_419d4008768d45f3_a6912e584cf1b500', 'status':'VISIBLE',
'txnId':'350'}
+
+mysql> select * from test_table;
++---------+------------+----------+-------------+---------+
+| user_id | date | group_id | modify_date | keyword |
++---------+------------+----------+-------------+---------+
+| 2 | 2023-04-29 | 2 | 2023-05-01 | bbb |
+| 1 | 2023-04-28 | 2 | 2023-04-29 | foo |
++---------+------------+----------+-------------+---------+
+```
+
+## Conclusion
+
+Congratulations. Now you've gained an overview of how data updates are
implemented in Apache Doris. With this knowledge, you can basically guarantee
efficiency and accuracy of data updating. But wait, there is so much more about
that. As Apache Doris 2.0 is going to provide more powerful Partial Column
Update capabilities, with improved execution of the Update statement and the
support for more complicated multi-table Join queries, I will show you how to
take advantage of them in details [...]
+
diff --git a/blog/Poly.md b/blog/Poly.md
new file mode 100644
index 00000000000..20916cbf13b
--- /dev/null
+++ b/blog/Poly.md
@@ -0,0 +1,105 @@
+---
+{
+ 'title': 'For Entry-Level Data Engineers: How to Build a Simple but Solid
Data Architecture',
+ 'summary': "This article aims to provide reference for non-tech companies
who are seeking to empower your business with data analytics.",
+ 'date': '2023-07-31',
+ 'author': 'Zhenwei Liu',
+ 'tags': ['Apache Doris'],
+}
+
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+For Entry-Level Data Engineers: How to Build a Simple but Solid Data
Architecture
+
+
+
+This article aims to provide reference for non-tech companies who are seeking
to empower your business with data analytics. You will learn the basics about
how to build an efficient and easy-to-use data system, and I will walk you
through every aspect of it with a use case of Apache Doris, an MPP-based
analytic data warehouse.
+
+## What You Need
+
+This case is about a ticketing service provider who want a data platform that
boasts quick processing, low maintenance costs, and ease of use, and I think
they speak for the majority of entry-level database users.
+
+A prominent feature of ticketing services is the periodic spikes in ticket
orders, you know, before the shows go on. So from time to time, the company has
a huge amount of new data rushing in and requires real-time processing of it,
so they can make timely adjustments during the short sales window. But in other
time, they don't want to spend too much energy and funds on maintaining the
data system. Furthermore, for a beginner of digital operation who only require
basic analytic functions [...]
+
+## Simple Architecture
+
+The building blocks of this architecture are simple. You only need Apache
Flink and Apache Kafka for data ingestion, and Apache Doris as an analytic data
warehouse.
+
+
+
+Connecting data sources to the data warehouse is simple, too. The key
component, Apache Doris, supports various data loading methods to fit with
different data sources. You can perform column mapping, transforming, and
filtering during data loading to avoid duplicate collection of data. To ingest
a table, users only need to add the table name to the configurations, instead
of writing a script themselves.
+
+## Data Update
+
+Flink CDC was found to be the optimal choice if you are looking for higher
stability in data ingestion. It also allows you to update the dynamically
changing tables in real time. The process includes the following steps:
+
+- Configure Flink CDC for the source MySQL database, so that it allows dynamic
updating of the table management configurations (which you can think of as the
"metadata").
+- Create two CDC jobs in Flink, one to capture the changed data (the Forward
stream), the other to update the table management configurations (the Broadcast
stream).
+- Configure all tables of the source database at the Sink end (the output end
of Flink CDC). When there is newly added table in the source database, the
Broadcast stream will be triggered to update the table management
configurations. (You just need to configure the tables, instead of "creating"
the tables.)
+
+
+
+## Layering of Data Warehouse
+
+Data flows from various sources into the data warehouse, where it is cleaned
and organized before it is ready for queries and analysis. The data processing
here is divided into five typical layers. Such layering simplifies the data
cleaning process because it provides a clear division of labor and makes things
easier to locate and comprehend.
+
+- **ODS**: This is the prep zone of the data warehouse. The unprocessed
original data is put in the [Unique Key
Model](https://doris.apache.org/docs/dev/data-table/data-model/#unique-model)
of Apache Doris, which can avoid duplication of data.
+- **DWD**: This layer cleans, formats, and de-identifies data to produce fact
tables. Every detailed data record is preserved. Data in this layer is also put
into the Unique Key Model.
+- **DWS**: This layer produces flat tables of a certain theme (order, user,
etc.) based on data from the DWD layer.
+- **ADS**: This layer auto-aggregates data, which is implemented by the
[Aggregate Key
Model](https://doris.apache.org/docs/dev/data-table/data-model/#aggregate-model)
of Apache Doris.
+- **DIM**: The DIM layer accommodates dimension data (in this case, data about
the theaters, projects, and show sessions, etc.), which is used in combination
with the order details.
+
+After the original data goes through these layers, it is available for queries
via one data export interface.
+
+## Reporting
+
+Like many non-tech business, the ticketing service provider needs a data
warehouse mainly for reporting. They derive trends and patterns from all kinds
of data reports, and then figure out ways towards efficient management and
sales increase. Specifically, this is the information they are observing in
their reports:
+
+- **Statistical Reporting**: These are the most frequently used reports,
including sales reports by theater, distribution channel, sales representative,
and show.
+- **Agile Reporting**: These are reports developed for specific purposes, such
as daily and weekly project data reports, sales summary reports, GMV reports,
and settlement reports.
+- **Data Analysis**: This involves data such as membership orders, attendance
rates, and user portraits.
+- **Dashboarding**: This is to visually display sales data.
+
+
+
+These are all entry-level tasks in data analytics. One of the biggest burdens
for the data engineers was to quickly develop new reports as the internal
analysts required. The [Aggregate Key
Model](https://doris.apache.org/docs/dev/data-table/data-model#aggregate-model)
of Apache Doris is designed for this.
+
+### Quick aggregation to produce reports on demand
+
+For example, supposing that analysts want a sales report by sales
representatives, data engineers can produce that by simple configuration:
+
+1. Put the original data in the Aggregate Key Model
+2. Specify the sales representative ID column and the payment date column as
the Key columns, and the order amount column as the Value column
+
+Then, order amounts of the same sale representative within the specified
period of time will be auto-aggregated. Bam! That's the report you need!
+
+According to the user, this whole process only takes them 10~30 minutes,
depending on the complexity of the report required. So the Aggregate Key Model
largely releases data engineers from the pressure of report development.
+
+### Quick response to data queries
+
+Most data analysts would just want their target data to be returned the second
they need it. In this case, the user often leverages two capabilities of Apache
Doris to realize quick query response.
+
+Firstly, Apache Doris is famously fast in Join queries. So if you need to
extract information across multiple tables, you are in good hands. Secondly, in
data analysis, it often happens that analysts frequently input the same
request. For example, they frequently want to check the sales data of different
theaters. In this scenario, Apache Doris allows you to create a [Materialized
View](https://doris.apache.org/docs/dev/query-acceleration/materialized-view/),
which means you pre-aggregat [...]
+
+## Conclusion
+
+This is the overview of a simple data architecture and how it can provide the
data services you need. It ensures data ingestion stability and quality with
Flink CDC, and quick data analysis with Apache Doris. The deployment of this
architecture is simple, too. If you plan for a data analytic upgrade for your
business, you might refer to this case. If you need advice and help, you may
join our [community here](https://t.co/ZxJuNJHXb2).
diff --git a/blog/Zhihu.md b/blog/Zhihu.md
new file mode 100644
index 00000000000..9bef297d53a
--- /dev/null
+++ b/blog/Zhihu.md
@@ -0,0 +1,140 @@
+---
+{
+ 'title': 'Database Dissection: How Fast Data Queries Are Implemented',
+ 'summary': "What's more important than quick performance itself is the
architectural design and mechanism that enable it.",
+ 'date': '2023-07-16',
+ 'author': 'Rong Hou',
+ 'tags': ['Best Practice'],
+}
+
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+
+
+In data analytics, fast query performance is more of a result than a
guarantee. What's more important than the result itself is the architectural
design and mechanism that enables quick performance. This is exactly what this
post is about. I will put you into context with a typical use case of Apache
Doris, an open-source MPP-based analytic database.
+
+The user in this case is an all-category Q&A website. As a billion-dollar
listed company, they have their own data management platform. What Doris does
is to support the data filtering, packaging, analyzing, and monitoring
workloads of that platform. Based on their huge data size, the user demands
quick data loading and quick response to queries.
+
+## How to Enable Quick Queries on Huge Dataset
+
+- **Scenario**: user segmentation for the website
+- **Data size**: 100 billion data objects, 2.4 million tags
+- **Requirements**: query response time < 1 second; result packaging < 10
seconds
+
+For these goals, the engineers have made three critical changes in their data
processing pipeline.
+
+### 1.Distribute the data
+
+User segmentation is when analysts pick out a group of website users that
share certain characteristics (tags). In the database system, this process is
implemented by a bunch of set operations (union, intersection, and difference).
+
+**Narration from the engineers:**
+
+We realize that instead of executing set operations on one big dataset, we can
divide our dataset into smaller ones, execute set operations on each of them,
and then merge all the results. In this way, each small dataset is computed by
one thread/queue. Then we have a queue to do the final merging. It's simple
distributed computing thinking.
+
+
+
+Example:
+
+1. Every 1 million users are put into one group with a `group_id`.
+2. All user tags in that same group will relate to the corresponding
`group_id`.
+3. Calculate the union/intersection/difference within each group. (Enable
multi-thread mode to increase computation efficiency.)
+4. Merge the results from the groups.
+
+The problem here is, since user tags are randomly distributed across various
machines, the computation entails multi-time shuffling, which brings huge
network overhead. That leads to the second change.
+
+### 2.Pre-bind a data group to a machine
+
+This is enabled by the Colocate mechanism of Apache Doris. The idea of
Colocate is to place data chunks that are often accessed together onto the same
node, so as to reduce cross-node data transfer and thus, get lower latency.
+
+
+
+The implementation is simple: Bind one group key to one machine. Then
naturally, data corresponding to that group key will be pre-bound to that
machine.
+
+The following is the query plan before we adopted Collocate: It is
complicated, with a lot of data shuffling.
+
+
+
+This is the query plan after. It is much simpler, which is why queries are
much faster and less costly.
+
+
+
+### 3.Merge the operators
+
+In data queries, the engineers realized that they often use a couple of
functions in combination, so they decided to develop compound functions to
further improve execution efficiency. They came to the Doris
[community](https://t.co/XD4uUSROft) and talked about their thoughts. The Doris
developers provided support for them and soon the compound functions are ready
for use on Doris. These are a few examples:
+
+```
+bitmap_and_count == bitmap_count(bitmap_and(bitmap1, bitmap2))
+bitmap_and_not_count == bitmap_count(bitmap_not(bitmap1, bitmap_and(bitmap1,
bitmap2))
+orthogonal_bitmap_union_count==bitmap_and(bitmap1,bitmap_and(bitmap2,bitmap3)
+```
+
+Query execution with one compound function is much faster than that with a
chain of simple functions, as you can tell from the lengths of the flow charts:
+
+
+
+- **Multiple Simple functions**: This involves three function executions and
two intermediate storage. It's a long and slow process.
+- **One compound function**: Simple in and out.
+
+## How to Quickly Ingest Large Amounts of Data
+
+This is about putting the right workload on the right component. Apache Doris
supports a variety of data loading methods. After trials and errors, the user
settled on Spark Load and thus decreased their data loading time by 90%.
+
+**Narration from the engineers:**
+
+In offline data ingestion, we used to perform most computation in Apache Hive,
write the data files to HDFS, and pull data regularly from HDFS to Apache
Doris. However, after Doris obtains parquet files from HDFS, it performs a
series of operations on them before it can turn them into segment files:
decompressing, bucketing, sorting, aggregating, and compressing. These
workloads will be borne by Doris backends, which have to undertake a few bitmap
operations at the same time. So there is [...]
+
+
+
+So we decided on the Spark Load method. It allows us to split the ingestion
process into two parts: computation and storage, so we can move all the
bucketing, sorting, aggregating, and compressing to Spark clusters. Then Spark
writes the output to HDFS, from which Doris pulls data and flushes it to the
local disks.
+
+
+
+When ingesting 1.2 TB data (that's 110 billion rows), the Spark Load method
only took 55 minutes.
+
+## A Vectorized Execution Engine
+
+In addition to the above changes, a large part of the performance of a
database relies on its execution engine. In the case of Apache Doris, it has
fully vectorized its storage and computation layers since version 1.1. The
longtime user also witnessed this revolution, so we invited them to test how
the vectorized engine worked.
+
+They compared query response time before and after the vectorization in seven
of its frequent scenarios:
+
+- Scenario 1: Simple user segmentation (hundreds of filtering conditions),
data packaging of a multi-million user group.
+- Scenario 2: Complicated user segmentation (thousands of filtering
conditions), data packaging of a tens-of-million user group.
+- Scenario 3: Multi-dimensional filtering (6 dimensions), single-table query,
**single-date flat table**, data aggregation, 180 million rows per day.
+- Scenario 4: Multi-dimensional filtering (6 dimensions), single-table query,
**multi-date flat table**, data aggregation, 180 million rows per day.
+- Scenario 5: **Single-table query**, COUNT, 180 million rows per day.
+- Scenario 6: **Multi-table query**, (Table A: 180 million rows, SUM, COUNT;
Table B: 1.5 million rows, bitmap aggregation), aggregate Table A and Table B,
join them with Table C, and then join the sub-tables, six joins in total.
+- Scenario 7: Single-table query, 500 million rows of itemized data
+
+The results are as below:
+
+
+
+## Conclusion
+
+In short, what contributed to the fast data loading and data queries in this
case?
+
+- The Colocate mechanism that's designed for distributed computing
+- Collaboration between database users and
[developers](https://t.co/ZxJuNJHXb2) that enables the operator merging
+- Support for a wide range of data loading methods to choose from
+- A vectorized engine that brings overall performance increase
+
+It takes efforts from both the database developers and users to make fast
performance possible. The user's experience and knowledge of their own status
quo will allow them to figure out the quickest path, while a good database
design will help pave the way and make users' life easier.
\ No newline at end of file
diff --git a/static/images/Dataupdate_1.png b/static/images/Dataupdate_1.png
new file mode 100644
index 00000000000..065b36a48a1
Binary files /dev/null and b/static/images/Dataupdate_1.png differ
diff --git a/static/images/Dataupdate_2.png b/static/images/Dataupdate_2.png
new file mode 100644
index 00000000000..ec817a7327a
Binary files /dev/null and b/static/images/Dataupdate_2.png differ
diff --git a/static/images/Dataupdate_3.png b/static/images/Dataupdate_3.png
new file mode 100644
index 00000000000..62c0d37797d
Binary files /dev/null and b/static/images/Dataupdate_3.png differ
diff --git a/static/images/Poly_1.png b/static/images/Poly_1.png
new file mode 100644
index 00000000000..6906d77b767
Binary files /dev/null and b/static/images/Poly_1.png differ
diff --git a/static/images/Poly_2.png b/static/images/Poly_2.png
new file mode 100644
index 00000000000..7076b6122b9
Binary files /dev/null and b/static/images/Poly_2.png differ
diff --git a/static/images/Poly_3.png b/static/images/Poly_3.png
new file mode 100644
index 00000000000..d9c417b8a47
Binary files /dev/null and b/static/images/Poly_3.png differ
diff --git a/static/images/Zhihu_1.png b/static/images/Zhihu_1.png
new file mode 100644
index 00000000000..3f83fa7c2a3
Binary files /dev/null and b/static/images/Zhihu_1.png differ
diff --git a/static/images/Zhihu_2.png b/static/images/Zhihu_2.png
new file mode 100644
index 00000000000..50e9534e0f0
Binary files /dev/null and b/static/images/Zhihu_2.png differ
diff --git a/static/images/Zhihu_3.png b/static/images/Zhihu_3.png
new file mode 100644
index 00000000000..6a5f5e7c5f6
Binary files /dev/null and b/static/images/Zhihu_3.png differ
diff --git a/static/images/Zhihu_4.png b/static/images/Zhihu_4.png
new file mode 100644
index 00000000000..6f103e3cb76
Binary files /dev/null and b/static/images/Zhihu_4.png differ
diff --git a/static/images/Zhihu_5.png b/static/images/Zhihu_5.png
new file mode 100644
index 00000000000..c391ab19b8b
Binary files /dev/null and b/static/images/Zhihu_5.png differ
diff --git a/static/images/Zhihu_6.png b/static/images/Zhihu_6.png
new file mode 100644
index 00000000000..61e7e54dcd1
Binary files /dev/null and b/static/images/Zhihu_6.png differ
diff --git a/static/images/Zhihu_7.png b/static/images/Zhihu_7.png
new file mode 100644
index 00000000000..256d5ec177d
Binary files /dev/null and b/static/images/Zhihu_7.png differ
diff --git a/static/images/Zhihu_8.png b/static/images/Zhihu_8.png
new file mode 100644
index 00000000000..c5ad192c382
Binary files /dev/null and b/static/images/Zhihu_8.png differ
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]