prodriguezdefino commented on code in PR #29619: URL: https://github.com/apache/beam/pull/29619#discussion_r1440769425
########## website/www/site/content/en/blog/scaling-streaming-workload.md: ########## @@ -0,0 +1,291 @@ +--- +layout: post +title: "Scaling a streaming workload on Apache Beam, 1 million events per second and beyond" +date: 2023-12-01 00:00:01 -0800 +categories: + - blog +authors: + - pabs +--- +<!-- +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at +http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> + +# Scaling a streaming workload on Apache Beam + +<img class="center-block" + src="/images/blog/scaling-streaming-workload/0-intro.png" + alt="Streaming Processing"> + +Scaling a streaming workload is critical for ensuring that a pipeline can handle large amounts of data being processed, while minimizing latency and executing efficiently. Without proper scaling, a pipeline may experience performance issues or even fail entirely, delaying the time to insights for the business. + +Given the Apache Beam support for the sources and sinks needed by the workload, developing a streaming pipeline can be easy. You can focus on the processing (transformations, enrichments, or aggregations) and on setting the right configurations for each case. + +However, you need to identify the key performance bottlenecks and make sure that the pipeline has the resources it needs to handle the load efficiently. This can involve right-sizing the number of workers, understanding the settings needed for the source and sinks of the pipeline, optimizing the processing logic, and even determining the transport formats. + +This article aims to illustrate how to manage the problem of scaling and optimizing a streaming workload developed in Apache Beam and run on Google Cloud using Dataflow. The goal is to reach one million events per second, while also minimizing latency and resource use during execution. The workload uses Pub/Sub as the streaming source and BigQuery as the sink. We plan to describe the reasoning behind the configuration settings and code changes we used to help the workload achieve the desired scale and beyond. + +Finally, it is worth mentioning that the progression described in this article maps to the evolution of a real-life workload, with simplifications. After the initial business requirements for the pipeline were achieved, the focus shifted to optimizing the performance and reducing the resources needed for the pipeline execution. + +## Execution setup + +For this article, we created a test suite that creates the necessary components for the pipelines to execute. You can find the code in [this Github repository](https://github.com/prodriguezdefino/apache-beam-streaming-tests). You can find the subsequent changes that are introduced on every run [in the https://github.com/prodriguezdefino/apache-beam-streaming-tests/tree/main/scaling-streaming-workload-blog folder](https://github.com/prodriguezdefino/apache-beam-streaming-tests/tree/main/scaling-streaming-workload-blog) as scripts that can be run to achieve similar results. + +All of the execution scripts can also execute a Terraform-based automation to create a Pub/Sub topic and subscription as well as a BigQuery dataset and table to run the workload. Also, it launches two pipelines: one data generation pipeline that pushes events to the Pub/Sub topic, and an ingestion pipeline that demonstrates the potential improvement points. + +In all cases, the pipelines start with an empty Pub/Sub topic and subscription and an empty BigQuery table. The plan is to generate one million events per second and, after a few minutes, review how the ingestion pipeline scales with time. The data being autogenerated is based on provided schemas or IDL (given the configuration), and the goal is to have messages ranging between 800 bytes and 2 KB, adding up to approximately 1 GB/s volume throughput. Also, the ingestion pipelines are using the same worker type configuration on all runs (`n2d-standard-4` GCE machines) and are capping the maximum workers number to avoid very large fleets. Review Comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
