[
https://issues.apache.org/jira/browse/FLINK-29566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xin Hao updated FLINK-29566:
----------------------------
Description:
Currently, when we remove the FlinkSessionJob object,
we always remove the object even if the Flink job is not being canceled
successfully.
This is *not semantic consistent* if the FlinkSessionJob has been removed but
the Flink job is still running.
One of the scenarios is that if we deploy a FlinkDeployment with HA mode.
When we remove the FlinkSessionJob and change the FlinkDeployment at the same
time,
or if the TMs are restarting because of some bugs such as OOM.
Both of these will cause the cancelation of the Flink job to fail because the
TMs are not available.
We should *reschedule* the cleanup logic if the FlinkDeployment is present.
And we can add a new ReconciliationState DELETING to indicate the
FlinkSessionJob's status.
The logic will be
{code:java}
if the FlinkDeployment is not present
delete the FlinkSessionJob object
else
if the JM is not available
reschedule
else
if cancel job successfully
delete the FlinkSessionJob object
else
reschedule{code}
When we cancel the Flink job, we need to verify all the jobs with the same name
have been deleted in case of the job id is changed after JM restarted.
was:
Currently, when we remove the FlinkSessionJob object,
we always remove the object even if the Flink job is not being canceled
successfully.
This is not semantic consistent if the FlinkSessionJob has been removed but the
Flink job is still running.
One of the scenarios is that if we deploy a FlinkDeployment with HA mode.
When we remove the FlinkSessionJob and change the FlinkDeployment at the same
time,
or if the TMs are restarting because of some bugs such as OOM.
Both of these will cause the cancelation of the Flink job to fail because the
TMs are not available.
We should reschedule the cleanup logic if the FlinkDeployment is present.
And we can add a new ReconciliationState DELETING to indicate the
FlinkSessionJob's status.
> Reschedule the cleanup logic if cancel job failed
> -------------------------------------------------
>
> Key: FLINK-29566
> URL: https://issues.apache.org/jira/browse/FLINK-29566
> Project: Flink
> Issue Type: Improvement
> Components: Kubernetes Operator
> Reporter: Xin Hao
> Priority: Minor
>
> Currently, when we remove the FlinkSessionJob object,
> we always remove the object even if the Flink job is not being canceled
> successfully.
>
> This is *not semantic consistent* if the FlinkSessionJob has been removed but
> the Flink job is still running.
>
> One of the scenarios is that if we deploy a FlinkDeployment with HA mode.
> When we remove the FlinkSessionJob and change the FlinkDeployment at the same
> time,
> or if the TMs are restarting because of some bugs such as OOM.
> Both of these will cause the cancelation of the Flink job to fail because the
> TMs are not available.
>
> We should *reschedule* the cleanup logic if the FlinkDeployment is present.
> And we can add a new ReconciliationState DELETING to indicate the
> FlinkSessionJob's status.
>
> The logic will be
> {code:java}
> if the FlinkDeployment is not present
> delete the FlinkSessionJob object
> else
> if the JM is not available
> reschedule
> else
> if cancel job successfully
> delete the FlinkSessionJob object
> else
> reschedule{code}
> When we cancel the Flink job, we need to verify all the jobs with the same
> name have been deleted in case of the job id is changed after JM restarted.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)