talatuyarer commented on code in PR #29794: URL: https://github.com/apache/beam/pull/29794#discussion_r1432969802
########## website/www/site/content/en/blog/apache-beam-flink-and-kubernetes-part2.md: ########## @@ -0,0 +1,161 @@ +--- +title: "Build a scalable, self-managed streaming infrastructure with Beam and Flink: Tackling Autoscaling Challenges - Part 2" +date: 2023-12-18 09:00:00 -0400 +categories: + - blog +authors: + - talat +--- +<!-- +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at +http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> + +Welcome back to Part 2 of our in-depth series on building and managing a service for Apache Beam Flink on Kubernetes. In this segment, we're taking a closer look at the hurdles we encountered while implementing autoscaling. + +<!--more--> + +# Build a scalable, self-managed streaming infrastructure with Flink: Tackling Autoscaling Challenges - Part 2 + +## Introduction + +Welcome back to Part 2 of our in-depth series on building and managing a service for Apache Beam Flink on Kubernetes. In this segment, we're taking a closer look at the hurdles we encountered while implementing autoscaling. These challenges weren't just roadblocks; they were opportunities for us to innovate and enhance our system. Let’s break down these issues, understand their context, and explore the solutions we developed. + +## Understanding Apache Beam Backlog Metrics in the Flink Runner Environment + +**The Challenge:** In our current setup, we are using Apache Flink for processing data streams. However, we've encountered a puzzling issue: our Flink job isn't showing the backlog metrics from Apache Beam. These metrics are critical for understanding the state and performance of our data pipelines. + +**What We Found:** Interestingly, we noticed that the metrics are actually being generated in KafkaIO, which is a part of our data pipeline handling Kafka streams. But when we try to monitor these metrics through the Apache Flink Metric system, they're nowhere to be found. This led us to suspect there might be an issue with the integration (or 'wiring') between Apache Beam and Apache Flink. + +**Digging Deeper:** On closer inspection, we found that the metrics should be emitted during the 'Checkpointing' phase of the data stream processing. This is a crucial step where the system takes a snapshot of the stream's state, and it's typically where metrics are generated for 'Unbounded' sources (sources that continuously stream data, like Kafka). + +**A Potential Solution:** We believe the root of the problem lies in how the metric context is set during this checkpointing phase. There seems to be a disconnect that's preventing the Beam metrics from being properly captured in the Flink Metric system. We've proposed a fix for this issue, which you can review and contribute to on our GitHub pull request here: [Apache Beam PR #29793](https://github.com/apache/beam/pull/29793). + + +<img class="center-block" +src="/images/blog/apache-beam-flink-and-kubernetes-part2/flink-backlog-metrics.png" +alt="Apache Flink Beam Backlog Metrics"> + + +## Overcoming Challenges in Checkpoint Size Reduction for Autoscaling Beam Jobs + Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
