This is an automated email from the ASF dual-hosted git repository.
luzhijing pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new b44ce6d2135 add blog xiaoe-tech (#332)
b44ce6d2135 is described below
commit b44ce6d2135092e5d3573372c4fd0cb160899987
Author: Hu Yanjun <[email protected]>
AuthorDate: Mon Oct 30 12:14:17 2023 +0800
add blog xiaoe-tech (#332)
* Add files via upload
* Add files via upload
---
...appens-in-real-time-is-analyzed-in-real-time.md | 222 +++++++++++++++++++++
static/images/xiaoe-tech-1.png | Bin 0 -> 578583 bytes
static/images/xiaoe-tech-2.png | Bin 0 -> 1088752 bytes
3 files changed, 222 insertions(+)
diff --git
a/blog/data-analysis-for-live-streaming-what-happens-in-real-time-is-analyzed-in-real-time.md
b/blog/data-analysis-for-live-streaming-what-happens-in-real-time-is-analyzed-in-real-time.md
new file mode 100644
index 00000000000..0ead2920ef8
--- /dev/null
+++
b/blog/data-analysis-for-live-streaming-what-happens-in-real-time-is-analyzed-in-real-time.md
@@ -0,0 +1,222 @@
+---
+{
+ 'title': 'Data Analysis for Live Streaming: What Happens in Real Time is
Analyzed in Real Time',
+ 'summary': "As live streaming emerges as a way of doing business, the need
for data analysis follows up. This post is about how a live streaming service
provider with 800 million end users found the right database to support its
analytic solution.",
+ 'date': '2023-10-30',
+ 'author': 'He Gong',
+ 'tags': ['Best Practice'],
+}
+
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+
+
+## What's different about data analytics in live streaming?
+
+Live streaming is one typical use case for real-time data analysis, because it
stresses speed. Livestream organizers need to keep abreast of the latest data
to see what is happening and maximize effectiveness. To realize that requires
high efficiency in every step of data processing:
+
+- **Data writing**: A live event churns out huge amounts of data every second,
so the database should be able to ingest such high throughput stably.
+- **Data update**: As life itself, live streaming entails a lot of data
changes, so there should be a quick and reliable data updating mechanism to
absorb the changes.
+- **Data queries**: Data should be ready and accessible as soon as analysts
want it. Mostly that means real-time visibility.
+- **Maintenance**: What's special about live streaming is that the data stream
has prominent highs and lows. The analytic system should be able to ensure
stability during peak times, and allow scaling down in off-peak times in order
to improve resource utilization. If possible, it should also provide disaster
recovery services to guarantee system availability, since the worst case in
live streaming is interruption.
+
+The rest of this post is about how a live streaming service provider with 800
million end users found the right database to support its analytic solution.
+
+## Simplify the Components
+
+In this case, the live streaming data analytic platform adopts the Lambda
architecture, which consists of a batch processing pipeline and a streaming
pipeline, the former for user profile information and the latter for real-time
generated data, including metrics like real-time subscription, visitor count,
comments and responses.
+
+- **Batching processing**: The user basic information stored in HDFS is
written into HBase to form a table.
+- **Streaming**: Real-time generated data from MySQL, collected via Flink CDC,
goes into Apache Kafka. Flink works as the computation engine and then the data
is stored in Redis.
+
+
+
+The real-time metrics will be combined with the user profile information to
form a flat table, and Elasticsearch will work as the query engine.
+
+As their business burgeons, the expanding data size becomes unbearable for
this platform, with problems like:
+
+- **Delayed data writing**: The multiple components result in multiple steps
in data writing, and inevitably lead to prolonged data writing, especially
during peak times.
+- **Complicated updating mechanism**: Every time there is a data change, such
as that in user subscription information, it must be updated into the main
tables and dimensional tables, and then the tables are correlated to generate a
new flat table. And don't forget that this long process has to be executed
across multiple components. So just imagine the complexity.
+- **Slow queries**: As the query engine, Elasticsearch struggles with
concurrent query requests and data accesses. It is also not flexible enough to
deal with the join queries.
+- **Time-consuming maintenance**: All engineers developing or maintaining this
platform need to master all the components. That's a lot of training. And
adding new metrics to the data pool is labor-intensive.
+
+So to sum up, the main problem for this architecture is its complexity. To
reduce the components means to find a database that is not only capable of most
workloads, but also performant in data writing and queries. After 6 months of
testing, they finally upgraded their live streaming analytic platform with
[Apache Doris](https://doris.apache.org/).
+
+They converge the streaming and the batch processing pipelines at Apache
Doris. It can undertake analytic workloads and also provides a storage layer so
data doesn't have to shuffle back to Elasticsearch and HBase as it did in the
old architecture.
+
+With Apache Doris as the data warehouse, the platform architecture becomes
neater.
+
+
+
+- **Smooth data writing**: Raw data is processed by Flink and written into
Apache Doris in real time. The Doris community provides a
[Flink-Doris-Connector](https://github.com/apache/doris-flink-connector) with
built-in Flink CDC.
+- **Flexible data update**: For data changes, Apache Doris implements
[Merge-on-Write](https://doris.apache.org/docs/data-table/data-model/#merge-on-write).
This is especially useful in small-batch real-time writing because you don't
have to renew the entire flat table. It also supports partial update of
columns, which is another way to make data updates more lightweight. In this
case, Apache Doris is able to finish Upsert or Insert Overwrite operations for
**200,000 rows per second**, a [...]
+- **Faster queries**: For join queries, Apache Doris can easily join multiple
large tables (10 billion rows). It can respond to a rich variety of queries
within seconds or even milliseconds, including tag retrievals, fuzzy queries,
ranking, and paginated queries.
+- **Easier maintenance**: As for Apache Doris itself, the frontend and backend
nodes are both flexibly scalable. It is compatible with MySQL protocol. What
took the developers a month now can be finished within a week, which allows for
more agile iteration of metrics.
+
+The above shows how Apache Doris speeds up the entire data processing pipeline
with its all-in-one capabilities. Beyond that, it has some delightful features
that can increase query efficiency and ensure service reliability in the case
of live streaming.
+
+## Disaster Recovery
+
+The last thing you want in live streaming is service breakdown, so disaster
recovery is necessary.
+
+Before the live streaming platform had Apache Doris in place, they only backed
up their data to object storage. It took an hour from when a failure was
reported to when it was fixed. That one-hour window is fatal for live commerce
because viewers will leave immediately. Thus, disaster recovery must be quick.
+
+Now, with Apache Doris, they have a dual-cluster solution: a primary cluster
and a backup cluster. This is for hot backup. Besides that, they have a cold
backup plan, which is the same as what they did: backing up their everyday data
to object storage via Backup and Freeze policies.
+
+This is how they do hot backup before [Apache Doris
2.0](https://doris.apache.org/zh-CN/blog/release-note-2.0.0):
+
+- **Data dual-write**: Write data to both the primary cluster and backup
cluster.
+- **Load balancing**: In case there is something wrong with one cluster, query
requests can be directed to the other cluster via reverse proxy.
+- **Monitoring**: Regularly check the data consistency between the two
clusters.
+
+Apache Doris 2.0 supports [Cross Cluster Replication
(CCR)](https://doris.apache.org/zh-CN/blog/release-note-2.0.0#support-for-cross-cluster-replication-ccr),
which can automate the above processes to reduce maintenance costs and
inconsistency risks due to human factors.
+
+## Data Visualization
+
+In addition to reporting, dashboarding, and ad-hoc queries, the platform also
allows analysts to configure various data sources to produce their own
visualized data lists.
+
+Apache Doris is compatible with most BI tools on the market, so the platform
developers can tap on that and provide a broader set of functionalities for
live streamers.
+
+Also, built on the real-time capabilities and quick computation of Apache
Doris, live streams can view data and see what happens in real time, instead of
waiting for a day for data analysis.
+
+## Bitmap Index to Accelerate Tag Queries
+
+A big part of data analysis in live streaming is viewer profiling. Viewers are
divided into groups based on their online footprint. They are given tags like
"watched for over one minute" and "visited during the past minute". As the show
goes on, viewers are constantly tagged and untagged. In the data warehouse, it
means frequent data insertion and deletion. Plus, one viewer is given multiple
tags. To gain an overall understanding of users entail join queries, which is
why the join perfor [...]
+
+The following snippets give you a general idea of how to tag users and conduct
tag queries in Apache Doris.
+
+**Create a Tag Table**
+
+A tag table lists all the tags that are given to the viewers, and maps the
tags to the corresponding viewer ID.
+
+```SQL
+create table db.tags (
+u_id string,
+version string,
+tags string
+) with (
+'connector' = 'doris',
+'fenodes' = '',
+'table.identifier' = 'tags',
+'username' = '',
+'password' = '',
+'sink.properties.format' = 'json',
+'sink.properties.strip_outer_array' = 'true',
+'sink.properties.fuzzy_parse' = 'true',
+'sink.properties.columns' =
'id,u_id,version,a_tags,m_tags,a_tags=bitmap_from_string(a_tags),m_tags=bitmap_from_string(m_tags)',
+'sink.batch.interval' = '10s',
+'sink.batch.size' = '100000'
+);
+```
+
+**Create a Tag Version Table**
+
+The tag table is constantly changing, so there are different versions of it as
time goes by.
+
+```SQL
+create table db.tags_version (
+id string,
+u_id string,
+version string
+) with (
+'connector' = 'doris',
+'fenodes' = '',
+'table.identifier' = 'db.tags_version',
+'username' = '',
+'password' = '',
+'sink.properties.format' = 'json',
+'sink.properties.strip_outer_array' = 'true',
+'sink.properties.fuzzy_parse' = 'true',
+'sink.properties.columns' = 'id,u_id,version',
+'sink.batch.interval' = '10s',
+'sink.batch.size' = '100000'
+);
+```
+
+**Write Data into Tag Table and Tag Version Table**
+
+```SQL
+insert into db.tags
+select
+u_id,
+last_timestamp as version,
+tags
+from db.source;
+
+insert into rtime_db.tags_version
+select
+u_id,
+last_timestamp as version
+from db.source;
+```
+
+**Tag Queries Accelerated by Bitmap Index**
+
+For example, analysts need to find out the latest tags related to a certain
viewer with the last name Thomas. Apache Doris will run the LIKE operator in
the user information table to find all "Thomas". Then it creates bitmap indexes
for the tags. Lastly, it relates all user information table, tag table, and tag
version table to return the result.
+
+**Of almost a billion viewers and each of them has over a thousand tags, the
bitmap index can help reduce the query response time to less than one second.**
+
+```SQL
+with t_user as (
+ select
+ u_id,
+ name
+ from db.user
+ where partition_id = 1
+ and name like '%Thomas%'
+),
+
+ t_tags as (
+ select
+ u_id,
+ version
+ from db.tags
+ where (
+ bitmap_and_count(a_tags,
bitmap_from_string("123,124,125,126,333")) > 0
+ )
+ ),
+
+ t_tag_version as (
+ select id, u_id, version
+ from db.tags_version
+ )
+
+select
+ t1.u_id
+ t1.name
+from t_user t1
+join t_tags t2 on t1.u_id = t2.u_id
+join t_tag_version t3 on t2.u_id = t3.u_id and t2.version = t3.version
+order by t1.u_id desc
+limit 1,10;
+```
+
+## Conclusion
+
+Data analysis in live streaming is challenging for the underlying database,
but it is also where the key competitiveness of Apache Doris comes to play.
First of all, Apache Doris can handle most data processing workloads, so
platform builders don't have to worry about putting many components together
and consequential maintenance issues. Secondly, it has a lot of
query-accelerating features, including but not limited to indexes. After
tackling the speed issues, the [Apache Doris develope [...]
+
+
+
+
+
+
+
diff --git a/static/images/xiaoe-tech-1.png b/static/images/xiaoe-tech-1.png
new file mode 100644
index 00000000000..fc099d3db51
Binary files /dev/null and b/static/images/xiaoe-tech-1.png differ
diff --git a/static/images/xiaoe-tech-2.png b/static/images/xiaoe-tech-2.png
new file mode 100644
index 00000000000..1a91671f74e
Binary files /dev/null and b/static/images/xiaoe-tech-2.png differ
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]