Dennis-Mircea Ciupitu created FLINK-39618:
---------------------------------------------

             Summary: FlinkDeployment deletion deadlocks when FlinkSessionJobs 
are running with default block-on-* options
                 Key: FLINK-39618
                 URL: https://issues.apache.org/jira/browse/FLINK-39618
             Project: Flink
          Issue Type: Bug
          Components: Kubernetes Operator
    Affects Versions: kubernetes-operator-1.14.0
            Reporter: Dennis-Mircea Ciupitu
             Fix For: kubernetes-operator-1.15.0


h1. Summary
{{FlinkDeployment}} deletion deadlocks when {{FlinkSessionJob}}s are running 
and both {{block-on-session-jobs}} and {{block-on-unmanaged-jobs}} are enabled 
(the defaults).

h1. Symptom

After {{kubectl delete flinkdeployment <name>}} against a session-mode 
deployment that has {{FlinkSessionJob}} resources attached, the deployment is 
stuck indefinitely in {{LIFECYCLE STATE: DELETING}} and the user must manually 
cancel the underlying Flink jobs (e.g. via the JM REST API) or flip a config 
flag and restart the operator to recover.

h1. Reproduction

With the operator's default configuration:

# Apply a session {{FlinkDeployment}} and one or more {{FlinkSessionJob}} 
resources targeting it.
# Wait for the jobs to reach {{RUNNING}}.
# Run {{kubectl delete flinkdeployment <name>}} (without first deleting the 
{{FlinkSessionJob}} CRs).
# Run {{kubectl delete flinksessionjob <names...>}} to satisfy the operator's 
complaint about "session jobs should be deleted first".
# Observe: the {{FlinkSessionJob}} CRs are gone from the API server, but the 
{{FlinkDeployment}} stays in {{DELETING}} forever, and the operator log keeps 
emitting:

{noformat}
Event[Job] | Warning | CLEANUPFAILED | The session cluster has non terminated 
jobs [<jobIds>] that should be cancelled first
{noformat}

h1. Root cause

{{SessionJobReconciler.cleanupInternal}} (introduced in FLINK-39271, 1.15) 
takes a "skip cancellation when the cluster is being deleted" bypass on the 
assumption that the cluster will tear itself down and the jobs will die with it:

{code:java}
if (sessionLifecycleState == ResourceLifecycleState.DELETING
        || sessionLifecycleState == ResourceLifecycleState.DELETED) {
    LOG.info("Session cluster is being deleted, skipping job cancellation");
    return DeleteControl.defaultDelete();
}
{code}

That assumption is invalidated by 
{{kubernetes.operator.session.deletion.block-on-unmanaged-jobs}} (introduced 
earlier in FLINK-28648, 1.13), which makes 
{{SessionReconciler.cleanupInternal}} poll the JobManager REST API and refuse 
to remove the {{FlinkDeployment}} finalizer while any non-terminal Flink jobs 
exist on the cluster. Because {{SessionJobReconciler}} skipped the cancel, 
those jobs are still running, so the finalizer is held forever. The session-job 
CRs are already gone, so there is no controller left that will ever issue the 
cancel. Both options default to {{true}}, so the deadlock occurs out of the box.

h1. Workarounds (until fixed)

* Cancel the jobs directly on the JM:
{code:bash}
kubectl port-forward svc/<rest-svc> 8081:8081
curl -X PATCH "http://localhost:8081/jobs/<jobId>?mode=cancel"
{code}
The next reconcile will then release the finalizer.

* Or set {{kubernetes.operator.session.deletion.block-on-unmanaged-jobs: 
"false"}} in the operator ConfigMap and restart the operator pod (less 
surgical, weakens an unrelated safety guard).

h1. Suggested fix

Gate the bypass on {{BLOCK_ON_SESSION_JOBS}} so that when the user has opted 
into strong delete-ordering guarantees the Flink job is still cancelled 
explicitly. This restores 1.14 behaviour for the deadlocking case while 
preserving the optimization for users who have opted out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to