[ 
https://issues.apache.org/jira/browse/FLINK-33324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17777755#comment-17777755
 ] 

dongwoo.kim edited comment on FLINK-33324 at 10/20/23 1:25 PM:
---------------------------------------------------------------

Hi, [~pnowojski] 

Thanks for the opinion. 
First about the code, I just simply wrapped the main logic code 
[here|https://github.com/apache/flink/blob/72e302310ba55bb5f35966ed448243aae36e193e/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/operators/BackendRestorerProcedure.java#L94]
 with callable object and combined with future.get(timeout). Please consider 
that it was just initial check for feasibility without a deep dive into the 
Flink code.

When considering manual action from human, I agree solving this issue with 
alert system seem practical. 
However, our goal for handling the failover loop was to automate operations 
using the failure-rate restart strategy and a cronJob that monitors the Flink 
job's status. 
Instead of adding ambiguous conditions in the cronJob, treating an unusually 
long restore operation as a failure simplifies our process. 
Yet, I understand from the feedback that this approach might fit more to our 
team's unique needs and might not be as helpful for everyone else.


was (Author: JIRAUSER300481):
Hi, [~pnowojski] 

Thanks for the opinion. 
First about the code, I just simply wrapped the main logic code 
[here|https://github.com/apache/flink/blob/72e302310ba55bb5f35966ed448243aae36e193e/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/operators/BackendRestorerProcedure.java#L94]
 with callable object and combined with future.get(timeout). Please consider 
that it was just initial check for feasibility without a deep dive into the 
Flink code.

When considering manual action from human, I agree solving this issue with 
alert system seem practical. 
However, our goal for handling the failover loop was to automate operations 
using the failure-rate restart strategy and a cronJob that monitors the Flink 
job's status. 
Instead of adding complex conditions in the cronJob, treating an unusually long 
restore operation as a failure simplifies our process. 
Yet, I understand from the feedback that this approach might fit more to our 
team's unique needs and might not be as helpful for everyone else.

> Add flink managed timeout mechanism for backend restore operation
> -----------------------------------------------------------------
>
>                 Key: FLINK-33324
>                 URL: https://issues.apache.org/jira/browse/FLINK-33324
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing, Runtime / State Backends
>            Reporter: dongwoo.kim
>            Priority: Minor
>         Attachments: image-2023-10-20-15-16-53-324.png, 
> image-2023-10-20-17-42-11-504.png
>
>
> Hello community, I would like to share an issue our team recently faced and 
> propose a feature to mitigate similar problems in the future.
> h2. Issue
> Our Flink streaming job encountered consecutive checkpoint failures and 
> subsequently attempted a restart. 
> This failure occurred due to timeouts in two subtasks located within the same 
> task manager. 
> The restore operation for this particular task manager also got stuck, 
> resulting in an "initializing" state lasting over an hour. 
> Once we realized the hang during the restore operation, we terminated the 
> task manager pod, resolving the issue.
> !image-2023-10-20-15-16-53-324.png|width=683,height=604!
> The sequence of events was as follows:
> 1. Checkpoint timed out for subtasks within the task manager, referred to as 
> tm-32.
> 2. The Flink job failed and initiated a restart.
> 3. Restoration was successful for 282 subtasks, but got stuck for the 2 
> subtasks in tm-32.
> 4. While the Flink tasks weren't fully in running state, checkpointing was 
> still being triggered, leading to consecutive checkpoint failures.
> 5. These checkpoint failures seemed to be ignored, and did not count to the 
> execution.checkpointing.tolerable-failed-checkpoints configuration. 
>      As a result, the job remained in the initialization phase for very long 
> period.
> 6. Once we found this, we terminated the tm-32 pod, leading to a successful 
> Flink job restart.
> h2. Suggestion
> I feel that, a Flink job remaining in the initializing state indefinitely is 
> not ideal. 
> To enhance resilience, I think it would be helpful if we could add timeout 
> feature for restore operation. 
> If the restore operation exceeds a specified duration, an exception should be 
> thrown, causing the job to fail. 
> This way, we can address restore-related issues similarly to how we handle 
> checkpoint failures.
> h2. Notes
> Just to add, I've made a basic version of this feature to see if it works as 
> expected. 
> I've attached a picture from the Flink UI that shows the timeout exception 
> happened during restore operation. 
> It's just a start, but I hope it helps with our discussion. 
> (I've simulated network chaos, using 
> [litmus|https://litmuschaos.github.io/litmus/experiments/categories/pods/pod-network-latency/#destination-ips-and-destination-hosts]
>  chaos engineering tool.)
> !image-2023-10-20-17-42-11-504.png|width=940,height=317!
>  
> Thank you for considering my proposal. I'm looking forward to hear your 
> thoughts. 
> If there's agreement on this, I'd be happy to work on implementing this 
> feature.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to