Re: [PR] Creating a Fully Managed Beam Streaming System with Flink Runner on Kubernetes - Part 3 [beam]

via GitHub Tue, 06 Feb 2024 07:43:33 -0800


damccorm commented on code in PR #29860:
URL: https://github.com/apache/beam/pull/29860#discussion_r1480057309



##########
website/www/site/content/en/blog/apache-beam-flink-and-kubernetes-part3.md:
##########
@@ -173,6 +173,18 @@ In the kubernetes world, the call will look like this for 
a scale up:
 
 Apache Flink will handle the rest of the work needed to scale up.
 
+## Maintaining State for Stateful Streaming Application with Autoscaling
+
+Adapting Apache Flink's state recovery mechanisms for autoscaling involves 
leveraging its robust features like max parallelism, checkpointing, and the 
Adaptive Scheduler to ensure efficient and resilient stream processing, even as 
the system dynamically adjusts to varying loads. Here's how these components 
work together in an autoscaling context:
+
+1. **Max Parallelism** sets an upper limit on how much a job can scale out, 
ensuring that state can be redistributed across a larger or smaller number of 
nodes without exceeding predefined boundaries. This is crucial for autoscaling 
because it allows Flink to manage state effectively, even as the number of task 
slots changes to accommodate varying workloads.
+2. **Checkpointing** is at the heart of Flink's fault tolerance mechanism, 
periodically saving the state of each job to a durable storage(In our case it 
is GCS bucket). In an autoscaling scenario, checkpointing enables Flink to 
recover to a consistent state after scaling operations. When the system scales 
out (adds more resources) or scales in (removes resources), Flink can restore 
the state from these checkpoints, ensuring data integrity and processing 
continuity without losing critical information. In scale down or up situations 
there could be a moment to reprocess data from last checkpoint. To reduce that 
amount we reduce checkpointing interval to 10 seconds.

Review Comment:
   ```suggestion
   2. **Checkpointing** is at the heart of Flink's fault tolerance mechanism, 
periodically saving the state of each job to a durable storage (in our case it 
is GCS bucket). In an autoscaling scenario, checkpointing enables Flink to 
recover to a consistent state after scaling operations. When the system scales 
out (adds more resources) or scales in (removes resources), Flink can restore 
the state from these checkpoints, ensuring data integrity and processing 
continuity without losing critical information. In scale down or up situations 
there could be a moment to reprocess data from last checkpoint. To reduce that 
amount we reduce the checkpointing interval to 10 seconds.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Creating a Fully Managed Beam Streaming System with Flink Runner on Kubernetes - Part 3 [beam]

Reply via email to