[
https://issues.apache.org/jira/browse/FLINK-33324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17777641#comment-17777641
]
Piotr Nowojski commented on FLINK-33324:
----------------------------------------
Could you share a code for the change, how complicated it is?
This issue might be also handled via monitoring and alerts in whatever system
you are using to monitor Flink's state/metrics. You could just very simply
implement an alert if job is stuck in initialising for too long. That alert
would require a manual action from a human being the same way, how you would
respond to a job stuck in a failover loop (I would guess you already have an
alert/monitoring for that). A narrow advantage of the timeout is that Flink
could re-try restoring once again, if this is an intermittent issue caused by
some silently broken connection.
> Add flink managed timeout mechanism for backend restore operation
> -----------------------------------------------------------------
>
> Key: FLINK-33324
> URL: https://issues.apache.org/jira/browse/FLINK-33324
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Checkpointing, Runtime / State Backends
> Reporter: dongwoo.kim
> Priority: Minor
> Attachments: image-2023-10-20-15-16-53-324.png,
> image-2023-10-20-17-42-11-504.png
>
>
> Hello community, I would like to share an issue our team recently faced and
> propose a feature to mitigate similar problems in the future.
> h2. Issue
> Our Flink streaming job encountered consecutive checkpoint failures and
> subsequently attempted a restart.
> This failure occurred due to timeouts in two subtasks located within the same
> task manager.
> The restore operation for this particular task manager also got stuck,
> resulting in an "initializing" state lasting over an hour.
> Once we realized the hang during the restore operation, we terminated the
> task manager pod, resolving the issue.
> !image-2023-10-20-15-16-53-324.png|width=683,height=604!
> The sequence of events was as follows:
> 1. Checkpoint timed out for subtasks within the task manager, referred to as
> tm-32.
> 2. The Flink job failed and initiated a restart.
> 3. Restoration was successful for 282 subtasks, but got stuck for the 2
> subtasks in tm-32.
> 4. While the Flink tasks weren't fully in running state, checkpointing was
> still being triggered, leading to consecutive checkpoint failures.
> 5. These checkpoint failures seemed to be ignored, and did not count to the
> execution.checkpointing.tolerable-failed-checkpoints configuration.
> As a result, the job remained in the initialization phase for very long
> period.
> 6. Once we found this, we terminated the tm-32 pod, leading to a successful
> Flink job restart.
> h2. Suggestion
> I feel that, a Flink job remaining in the initializing state indefinitely is
> not ideal.
> To enhance resilience, I think it would be helpful if we could add timeout
> feature for restore operation.
> If the restore operation exceeds a specified duration, an exception should be
> thrown, causing the job to fail.
> This way, we can address restore-related issues similarly to how we handle
> checkpoint failures.
> h2. Notes
> Just to add, I've made a basic version of this feature to see if it works as
> expected.
> I've attached a picture from the Flink UI that shows the timeout exception
> happened during restore operation.
> It's just a start, but I hope it helps with our discussion.
> (I've simulated network chaos, using
> [litmus|https://litmuschaos.github.io/litmus/experiments/categories/pods/pod-network-latency/#destination-ips-and-destination-hosts]
> chaos engineering tool.)
> !image-2023-10-20-17-42-11-504.png|width=940,height=317!
>
> Thank you for considering my proposal. I'm looking forward to hear your
> thoughts.
> If there's agreement on this, I'd be happy to work on implementing this
> feature.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)