[ 
https://issues.apache.org/jira/browse/FLINK-33324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17777755#comment-17777755
 ] 

dongwoo.kim commented on FLINK-33324:
-------------------------------------

Hi, [~pnowojski] 



Thanks for the opinion. 
First about the code, I just simply wrapped the main logic code 
[here|https://github.com/apache/flink/blob/72e302310ba55bb5f35966ed448243aae36e193e/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/operators/BackendRestorerProcedure.java#L94]
 with callable object and combined with future.get(timeout). Please consider 
that it was just initial check for feasibility without a deep dive into the 
Flink code.

When considering manual action from human, solving this issue with alert system 
seem practical. 
However, our goal for handling the failover loop was to automate operations 
using the failure-rate restart strategy and a cronJob that monitors the Flink 
job's status. 
Instead of adding complex conditions in the cronJob, treating an unusually long 
restore operation as a failure simplifies our process. 
Yet, I understand from the feedback that this approach might fit more to our 
team's unique needs and might not be as helpful for everyone else.

> Add flink managed timeout mechanism for backend restore operation
> -----------------------------------------------------------------
>
>                 Key: FLINK-33324
>                 URL: https://issues.apache.org/jira/browse/FLINK-33324
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing, Runtime / State Backends
>            Reporter: dongwoo.kim
>            Priority: Minor
>         Attachments: image-2023-10-20-15-16-53-324.png, 
> image-2023-10-20-17-42-11-504.png
>
>
> Hello community, I would like to share an issue our team recently faced and 
> propose a feature to mitigate similar problems in the future.
> h2. Issue
> Our Flink streaming job encountered consecutive checkpoint failures and 
> subsequently attempted a restart. 
> This failure occurred due to timeouts in two subtasks located within the same 
> task manager. 
> The restore operation for this particular task manager also got stuck, 
> resulting in an "initializing" state lasting over an hour. 
> Once we realized the hang during the restore operation, we terminated the 
> task manager pod, resolving the issue.
> !image-2023-10-20-15-16-53-324.png|width=683,height=604!
> The sequence of events was as follows:
> 1. Checkpoint timed out for subtasks within the task manager, referred to as 
> tm-32.
> 2. The Flink job failed and initiated a restart.
> 3. Restoration was successful for 282 subtasks, but got stuck for the 2 
> subtasks in tm-32.
> 4. While the Flink tasks weren't fully in running state, checkpointing was 
> still being triggered, leading to consecutive checkpoint failures.
> 5. These checkpoint failures seemed to be ignored, and did not count to the 
> execution.checkpointing.tolerable-failed-checkpoints configuration. 
>      As a result, the job remained in the initialization phase for very long 
> period.
> 6. Once we found this, we terminated the tm-32 pod, leading to a successful 
> Flink job restart.
> h2. Suggestion
> I feel that, a Flink job remaining in the initializing state indefinitely is 
> not ideal. 
> To enhance resilience, I think it would be helpful if we could add timeout 
> feature for restore operation. 
> If the restore operation exceeds a specified duration, an exception should be 
> thrown, causing the job to fail. 
> This way, we can address restore-related issues similarly to how we handle 
> checkpoint failures.
> h2. Notes
> Just to add, I've made a basic version of this feature to see if it works as 
> expected. 
> I've attached a picture from the Flink UI that shows the timeout exception 
> happened during restore operation. 
> It's just a start, but I hope it helps with our discussion. 
> (I've simulated network chaos, using 
> [litmus|https://litmuschaos.github.io/litmus/experiments/categories/pods/pod-network-latency/#destination-ips-and-destination-hosts]
>  chaos engineering tool.)
> !image-2023-10-20-17-42-11-504.png|width=940,height=317!
>  
> Thank you for considering my proposal. I'm looking forward to hear your 
> thoughts. 
> If there's agreement on this, I'd be happy to work on implementing this 
> feature.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to