Xin Hao created FLINK-29566:
-------------------------------
Summary: Reschedule the cleanup logic if cancel job failed
Key: FLINK-29566
URL: https://issues.apache.org/jira/browse/FLINK-29566
Project: Flink
Issue Type: Improvement
Components: Kubernetes Operator
Reporter: Xin Hao
Currently, when we remove the FlinkSessionJob object,
we always remove the object even if the Flink job is not being canceled
successfully.
This is not semantic consistent if the FlinkSessionJob has been removed but the
Flink job is still running.
One of the scenarios is that if we deploy a FlinkDeployment with HA mode.
When we remove the FlinkSessionJob and change the FlinkDeployment at the same
time,
or if the TMs are restarting because of some bugs such as OOM.
Both of these will cause the cancelation of the Flink job to fail because the
TMs are not available.
We should reschedule the cleanup logic if the FlinkDeployment is present.
And we can add a new ReconciliationState DELETING to indicate the
FlinkSessionJob's status.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)