Re: [PR] [Blog Post] Scaling a Dataflow streaming workload [beam]

via GitHub Wed, 03 Jan 2024 10:20:04 -0800


prodriguezdefino commented on code in PR #29619:
URL: https://github.com/apache/beam/pull/29619#discussion_r1440770388



##########
website/www/site/content/en/blog/scaling-streaming-workload.md:
##########
@@ -0,0 +1,291 @@
+---
+layout: post
+title:  "Scaling a streaming workload on Apache Beam, 1 million events per 
second and beyond"
+date:   2023-12-01 00:00:01 -0800
+categories:
+  - blog
+authors:
+  - pabs
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Scaling a streaming workload on Apache Beam
+
+<img class="center-block"
+    src="/images/blog/scaling-streaming-workload/0-intro.png"
+    alt="Streaming Processing">
+
+Scaling a streaming workload is critical for ensuring that a pipeline can 
process large amounts of data while also minimizing latency and executing 
efficiently. Without proper scaling, a pipeline may experience performance 
issues or even fail entirely, delaying the time to insights for the business.
+
+Given the Apache Beam support for the sources and sinks needed by the 
workload, developing a streaming pipeline can be easy. You can focus on the 
processing (transformations, enrichments, or aggregations) and on setting the 
right configurations for each case.
+
+However, you need to identify the key performance bottlenecks and make sure 
that the pipeline has the resources it needs to handle the load efficiently. 
This can involve right-sizing the number of workers, understanding the settings 
needed for the source and sinks of the pipeline, optimizing the processing 
logic, and even determining the transport formats.
+
+This article illustrates how to manage the problem of scaling and optimizing a 
streaming workload developed in Apache Beam and run on Google Cloud using 
Dataflow. The goal is to reach one million events per second, while also 
minimizing latency and resource use during execution. The workload uses Pub/Sub 
as the streaming source and BigQuery as the sink. We describe the reasoning 
behind the configuration settings and code changes we used to help the workload 
achieve the desired scale and beyond.
+
+The progression described in this article maps to the evolution of a real-life 
workload, with simplifications. After the initial business requirements for the 
pipeline were achieved, the focus shifted to optimizing the performance and 
reducing the resources needed for the pipeline execution.
+
+## Execution setup
+
+For this article, we created a test suite that creates the necessary 
components for the pipelines to execute. You can find the code in [this Github 
repository](https://github.com/prodriguezdefino/apache-beam-streaming-tests). 
You can find the subsequent changes that are introduced on every run  in the 
[`https://github.com/prodriguezdefino/apache-beam-streaming-tests/tree/main/scaling-streaming-workload-blog`](https://github.com/prodriguezdefino/apache-beam-streaming-tests/tree/main/scaling-streaming-workload-blog)
 folder as scripts that you can run to achieve similar results.

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Blog Post] Scaling a Dataflow streaming workload [beam]

Reply via email to