(doris-website) branch master updated: Add blog of QIFU tech (#374)

luzhijing Tue, 26 Dec 2023 21:15:39 -0800

This is an automated email from the ASF dual-hosted git repository.

luzhijing pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git



The following commit(s) were added to refs/heads/master by this push:
     new 2d251e074c5 Add blog of QIFU tech (#374)
2d251e074c5 is described below

commit 2d251e074c5c555b37ff34055930edb7ae72689c
Author: Hu Yanjun <[email protected]>
AuthorDate: Wed Dec 27 13:15:28 2023 +0800

    Add blog of QIFU tech (#374)
---
 ...ta-reporting-tagging-and-data-lake-analytics.md | 93 ++++++++++++++++++++++
 1 file changed, 93 insertions(+)

diff --git 
a/blog/apache-doris-speeds-up-data-reporting-tagging-and-data-lake-analytics.md 
b/blog/apache-doris-speeds-up-data-reporting-tagging-and-data-lake-analytics.md
new file mode 100644
index 00000000000..1713a546cfe
--- /dev/null
+++ 
b/blog/apache-doris-speeds-up-data-reporting-tagging-and-data-lake-analytics.md
@@ -0,0 +1,93 @@
+---
+{
+    'title': 'Apache Doris speeds up data reporting, tagging, and data lake 
analytics',
+    'summary': "The user leverages the capabilities of Apache Doris in 
reporting, customer tagging, and data lake analytics and achieves high 
performance.",
+    'date': '2023-12-27',
+    'author': 'Apache Doris',
+    'tags': ['Best Practice'],
+    "image": 
'https://cdn.selectdb.com/static/apache_doris_speeds_up_data_reporting_tagging_and_data_lake_analytics_87a6746df5.png'
+}
+
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+As much as we say [Apache Doris](https://doris.apache.org/) is an all-in-one 
data platform that is capable of various analytics workloads, it is always 
compelling to demonstrate that by real use cases. That's why I would like to 
share this user story with you. It is about how they leverage the capabilities 
of Apache Doris in reporting, customer tagging, and data lake analytics and 
achieve high performance.
+
+This fintech service provider is a long-term user of Apache Doris. They have 
almost 10 clusters for production, hundreds of Doris backend nodes, and 
thousands of CPU Cores. The total data size is near 1 PB. Every day, they have 
hundreds of workflows running simultaneously, receive almost 10 billion new 
data records, and respond to millions of data queries.
+
+Before migrating to Apache Doris, they used ClickHouse, MySQL, and 
Elasticsearch. Then frictions arise from their ever-enlarging data size. They 
found it hard to scale out the ClickHouse clusters because there were too many 
dependencies. As for MySQL, they had to switch between various MySQL instances 
because one MySQL instance had its limits and cross-instance queries were not 
supported.
+
+## Reporting
+
+### From ClickHouse + MySQL to Apache Doris
+
+Data reporting is one of the major services they provide to their customers 
and they are bound by an SLA. They used to support such service with a 
combination of ClickHouse and MySQL, but they found significant fluctuations in 
their data synchronization duration, making it hard for them to meet the 
service levels outlined in their SLA. Diagnosis showed that it was because the 
multiple components add to the complexity and instability of data 
synchronization tasks. To fix that, they have u [...]
+
+<div style={{textAlign:'center'}}><img 
src="https://cdn.selectdb.com/static/from_clickhouse_mysql_to_apache_doris_6387c0363a.png";
 alt="from-clickhouse-mysql-to-apache-doris" width="840"/></div >
+
+### Performance improvements
+
+With Apache Doris, they ingest data via the [Broker 
Load](https://doris.apache.org/docs/1.2/data-operate/import/import-way/broker-load-manual)
 method and reach an SLA compliance rate of over 99% in terms of data 
synchronization performance.
+
+<div style={{textAlign:'center'}}><img 
src="https://cdn.selectdb.com/static/data_synchronization_size_and_duration_327e4dc1fe.png";
 alt="data-synchronization-size-and-duration" width="640"/></div >
+
+As for data queries, the Doris-based architecture maintains an **average query 
response time** of less than **10s** and a **P90 response time** of less than 
**30s**. This is a 50% speedup compared to the old architecture. 
+
+<div style={{textAlign:'center'}}><img 
src="https://cdn.selectdb.com/static/average_query_response_time_372d71ef16.png";
 alt="average-query-response-time" width="840"/></div >
+
+<div style={{textAlign:'center'}}><img 
src="https://cdn.selectdb.com/static/query_response_time_percentile_756c6f6a71.png";
 alt="query-response-time-percentile" width="840"/></div >
+
+## Tagging
+
+Tagging is a common operation in customer analytics. You assign labels to 
customers based on their behaviors and characteristics, so that you can divide 
them into groups and figure out targeted marketing strategies for each group of 
them. 
+
+In the old processing architecture where Elasticsearch was the processing 
engine, raw data was ingested and tagged properly. Then, it will be merged into 
JSON files and imported into Elasticsearch, which provides data services for 
analysts and marketers. In this process, the merging step was to reduce updates 
and relieve load for Elasticsearch, but it turned out to be a troublemaker:
+
+- Any problematic data in any of the tags could spoil the entire merging 
operation and thus interrupt the data services.
+- The merging operation was implemented based on Spark and MapReduce and took 
up to 4 hours. Such a long time frame could encroach on marketing opportunities 
and lead to unseen losses.
+
+<div style={{textAlign:'center'}}><img 
src="https://cdn.selectdb.com/static/tagging_services_3263e21c36.png"; 
alt="tagging-services" width="840"/></div >
+
+Then Apache Doris takes this over. Apache Doris arranges tag data with its 
data models, which process data fast and smoothly. The aforementioned merging 
step can be done by the [Aggregate Key 
model](https://doris.apache.org/docs/data-table/data-model#aggregate-model), 
which aggregates tag data based on the specified Aggregate Key upon data 
ingestion. The [Unique Key 
model](https://doris.apache.org/docs/data-table/data-model#unique-model) is 
handy for partial column updates. Again, all yo [...]
+
+In terms of query performance, Doris is equipped with well-developed bitmap 
indexes and techniques tailored to high-concurrency queries, so in this case, 
it can finish **customer segmentation within seconds** and reach over **700 QPS 
in user-facing queries**.
+
+## Data lake analytics
+
+In data lake scenarios, the data size you need to handle tends to be huge, but 
the data processing volume in each query tends to vary. To ensure fast data 
ingestion and high query performance of huge data sets, you need more 
resources. On the other hand, during non-peak time, you want to scale down your 
cluster for more efficient resource management. How do you handle this dilemma?
+
+Apache Doris has a few features that are designed for data lake analytics, 
including Multi-Catalog and Compute Node. The former shields you from the 
headache of data ingestion in data lake analytics while the latter enables 
elastic cluster scaling.
+
+The 
[Multi-Catalog](https://doris.apache.org/docs/lakehouse/multi-catalog/?_highlight=multi&_highlight=catalog)
 mechanism allows you to connect Doris to a variety of external data sources so 
you can use Doris as a unified query gateway without worrying about bulky data 
ingestion into Doris.
+
+The [Compute Node](https://doris.apache.org/docs/advanced/compute-node/) of 
Apache Doris is a backend role that is designed for remote federated query 
workloads, like those in data lake analytics. Normal Doris backend nodes are 
responsible for both SQL query execution and data management, while the Compute 
Nodes in Doris, as the name implies, only perform computation. Compute Nodes 
are stateless, making them elastic enough for cluster scaling.
+
+The user introduces Compute Nodes into their cluster and deploys them with 
other components in a hybrid configuration. As a result, the cluster 
automatically scales down during the night, when there are fewer query 
requests, and scales out during the daytime to handle the massive query 
workload. This is more resource-efficient.
+
+For easier deployment, they have also optimized their Deploy on Yarn process 
via Skein. As is shown below, they define the number of Compute nodes and the 
required resources in the YAML file, and then pack the installation file, 
configuration file, and startup script into the distributed file system. In 
this way, they can start or stop the entire cluster of over 100 nodes within 
minutes using one simple line of code.
+
+<div style={{textAlign:'center'}}><img 
src="https://cdn.selectdb.com/static/skein_3516ba1a83.png"; alt="skein" 
width="560"/></div >
+
+## Conclusion
+
+For data reporting and customer tagging, Apache Doris smoothens data ingestion 
and merging steps, and delivers high query performance based on its own design 
and functionality. For data lake analytics, the user improves resource 
efficiency by elastic scaling of clusters using the Compute Node. Along their 
journey with Apache Doris, they have also developed a data ingestion task 
prioritizing mechanism and contributed it to the Doris project. A gesture to 
facilitate their use case ends up  [...]
+
+Check Apache Doris [repo](https://github.com/apache/doris) on GitHub
\ No newline at end of file


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(doris-website) branch master updated: Add blog of QIFU tech (#374)

Reply via email to