[
https://issues.apache.org/jira/browse/FLINK-36717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gyula Fora reassigned FLINK-36717:
----------------------------------
Assignee: Swapna Marru (was: Maximilian Michels)
> Add health check to detect tasks stuck in DEPLOYING state
> ---------------------------------------------------------
>
> Key: FLINK-36717
> URL: https://issues.apache.org/jira/browse/FLINK-36717
> Project: Flink
> Issue Type: New Feature
> Components: Kubernetes Operator
> Reporter: Maximilian Michels
> Assignee: Swapna Marru
> Priority: Major
>
> We have an opt-in feature for monitoring Flink cluster health by the
> operator. To enable it, set kubernetes.operator.cluster.health-check.enabled:
> true.
> If enabled, the ClusterHealthObserver, triggered by the
> ApplicationReconciler, collects various health-related metrics from the Flink
> cluster, such as the number of restarts, the last restart timestamp, the
> number of completed checkpoints, and the last completed checkpoint timestamp.
> The ClusterHealthEvaluator then analyzes this information to determine
> whether the Flink cluster is healthy or not.
> Recently, users have reported an issue where some TaskManagers get stuck in
> the task state DEPLOYING due to a faulty network connection, causing
> extremely slow TCP reads while fetching the user jar from S3. Restarting the
> TaskManager pods resolves this issue.
> The goal of this ticket is to add a feature to the operator to automatically
> restart TaskManagers which have tasks stuck in DEPLOYING state. To achieve
> this, we can monitor how long tasks remain in the DEPLOYING state and decide
> to restart the TaskManagers after a configured timeout. We must be careful to
> ensure that we don't include jobs with large state restores, which can take a
> long time. Fortunately, the task state is in INITIALIZING during state
> restoration, making it easily distinguishable from DEPLOYING when we still
> setup the task.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)