Dennis-Mircea Ciupitu created FLINK-39618:
---------------------------------------------
Summary: FlinkDeployment deletion deadlocks when FlinkSessionJobs
are running with default block-on-* options
Key: FLINK-39618
URL: https://issues.apache.org/jira/browse/FLINK-39618
Project: Flink
Issue Type: Bug
Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.14.0
Reporter: Dennis-Mircea Ciupitu
Fix For: kubernetes-operator-1.15.0
h1. Summary
{{FlinkDeployment}} deletion deadlocks when {{FlinkSessionJob}}s are running
and both {{block-on-session-jobs}} and {{block-on-unmanaged-jobs}} are enabled
(the defaults).
h1. Symptom
After {{kubectl delete flinkdeployment <name>}} against a session-mode
deployment that has {{FlinkSessionJob}} resources attached, the deployment is
stuck indefinitely in {{LIFECYCLE STATE: DELETING}} and the user must manually
cancel the underlying Flink jobs (e.g. via the JM REST API) or flip a config
flag and restart the operator to recover.
h1. Reproduction
With the operator's default configuration:
# Apply a session {{FlinkDeployment}} and one or more {{FlinkSessionJob}}
resources targeting it.
# Wait for the jobs to reach {{RUNNING}}.
# Run {{kubectl delete flinkdeployment <name>}} (without first deleting the
{{FlinkSessionJob}} CRs).
# Run {{kubectl delete flinksessionjob <names...>}} to satisfy the operator's
complaint about "session jobs should be deleted first".
# Observe: the {{FlinkSessionJob}} CRs are gone from the API server, but the
{{FlinkDeployment}} stays in {{DELETING}} forever, and the operator log keeps
emitting:
{noformat}
Event[Job] | Warning | CLEANUPFAILED | The session cluster has non terminated
jobs [<jobIds>] that should be cancelled first
{noformat}
h1. Root cause
{{SessionJobReconciler.cleanupInternal}} (introduced in FLINK-39271, 1.15)
takes a "skip cancellation when the cluster is being deleted" bypass on the
assumption that the cluster will tear itself down and the jobs will die with it:
{code:java}
if (sessionLifecycleState == ResourceLifecycleState.DELETING
|| sessionLifecycleState == ResourceLifecycleState.DELETED) {
LOG.info("Session cluster is being deleted, skipping job cancellation");
return DeleteControl.defaultDelete();
}
{code}
That assumption is invalidated by
{{kubernetes.operator.session.deletion.block-on-unmanaged-jobs}} (introduced
earlier in FLINK-28648, 1.13), which makes
{{SessionReconciler.cleanupInternal}} poll the JobManager REST API and refuse
to remove the {{FlinkDeployment}} finalizer while any non-terminal Flink jobs
exist on the cluster. Because {{SessionJobReconciler}} skipped the cancel,
those jobs are still running, so the finalizer is held forever. The session-job
CRs are already gone, so there is no controller left that will ever issue the
cancel. Both options default to {{true}}, so the deadlock occurs out of the box.
h1. Workarounds (until fixed)
* Cancel the jobs directly on the JM:
{code:bash}
kubectl port-forward svc/<rest-svc> 8081:8081
curl -X PATCH "http://localhost:8081/jobs/<jobId>?mode=cancel"
{code}
The next reconcile will then release the finalizer.
* Or set {{kubernetes.operator.session.deletion.block-on-unmanaged-jobs:
"false"}} in the operator ConfigMap and restart the operator pod (less
surgical, weakens an unrelated safety guard).
h1. Suggested fix
Gate the bypass on {{BLOCK_ON_SESSION_JOBS}} so that when the user has opted
into strong delete-ordering guarantees the Flink job is still cancelled
explicitly. This restores 1.14 behaviour for the deadlocking case while
preserving the optimization for users who have opted out.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)