Re: [PR] [Blog Post] Scaling a Dataflow streaming workload [beam]

via GitHub Wed, 03 Jan 2024 10:20:10 -0800


prodriguezdefino commented on code in PR #29619:
URL: https://github.com/apache/beam/pull/29619#discussion_r1440769425



##########
website/www/site/content/en/blog/scaling-streaming-workload.md:
##########
@@ -0,0 +1,291 @@
+---
+layout: post
+title:  "Scaling a streaming workload on Apache Beam, 1 million events per 
second and beyond"
+date:   2023-12-01 00:00:01 -0800
+categories:
+  - blog
+authors:
+  - pabs
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Scaling a streaming workload on Apache Beam
+
+<img class="center-block"
+    src="/images/blog/scaling-streaming-workload/0-intro.png"
+    alt="Streaming Processing">
+
+Scaling a streaming workload is critical for ensuring that a pipeline can 
handle large amounts of data being processed, while minimizing latency and 
executing efficiently. Without proper scaling, a pipeline may experience 
performance issues or even fail entirely, delaying the time to insights for the 
business.
+
+Given the Apache Beam support for the sources and sinks needed by the 
workload, developing a streaming pipeline can be easy. You can focus on the 
processing (transformations, enrichments, or aggregations) and on setting the 
right configurations for each case.
+
+However, you need to identify the key performance bottlenecks and make sure 
that the pipeline has the resources it needs to handle the load efficiently. 
This can involve right-sizing the number of workers, understanding the settings 
needed for the source and sinks of the pipeline, optimizing the processing 
logic, and even determining the transport formats.
+
+This article aims to illustrate how to manage the problem of scaling and 
optimizing a streaming workload developed in Apache Beam and run on Google 
Cloud using Dataflow. The goal is to reach one million events per second, while 
also minimizing latency and resource use during execution. The workload uses 
Pub/Sub as the streaming source and BigQuery as the sink. We plan to describe 
the reasoning behind the configuration settings and code changes we used to 
help the workload achieve the desired scale and beyond.
+
+Finally, it is worth mentioning that the progression described in this article 
maps to the evolution of a real-life workload, with simplifications. After the 
initial business requirements for the pipeline were achieved, the focus shifted 
to optimizing the performance and reducing the resources needed for the 
pipeline execution.
+
+## Execution setup
+
+For this article, we created a test suite that creates the necessary 
components for the pipelines to execute. You can find the code in [this Github 
repository](https://github.com/prodriguezdefino/apache-beam-streaming-tests). 
You can find the subsequent changes that are introduced on every run  [in the 
https://github.com/prodriguezdefino/apache-beam-streaming-tests/tree/main/scaling-streaming-workload-blog
 
folder](https://github.com/prodriguezdefino/apache-beam-streaming-tests/tree/main/scaling-streaming-workload-blog)
 as scripts that can be run to achieve similar results.
+
+All of the execution scripts can also execute a Terraform-based automation to 
create a Pub/Sub topic and subscription as well as a BigQuery dataset and table 
to run the workload. Also, it launches two pipelines: one data generation 
pipeline that pushes events to the Pub/Sub topic, and an ingestion pipeline 
that demonstrates the potential improvement points.
+
+In all cases, the pipelines start with an empty Pub/Sub topic and subscription 
and an empty BigQuery table. The plan is to generate one million events per 
second and, after a few minutes, review how the ingestion pipeline scales with 
time. The data being autogenerated is based on provided schemas or IDL (given 
the configuration), and the goal is to have messages ranging between 800 bytes 
and 2 KB, adding up to approximately 1 GB/s volume throughput. Also, the 
ingestion pipelines are using the same worker type configuration on all runs 
(`n2d-standard-4` GCE machines) and are capping the maximum workers number to 
avoid very large fleets.

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Blog Post] Scaling a Dataflow streaming workload [beam]

Reply via email to