[ 
https://issues.apache.org/jira/browse/FLINK-29566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Hao updated FLINK-29566:
----------------------------
    Description: 
Currently, when we remove the FlinkSessionJob object,

we always remove the object even if the Flink job is not being canceled 
successfully.

 

This is *not semantic consistent* if the FlinkSessionJob has been removed but 
the Flink job is still running.

 

One of the scenarios is that if we deploy a FlinkDeployment with HA mode.

When we remove the FlinkSessionJob and change the FlinkDeployment at the same 
time,

or if the TMs are restarting because of some bugs such as OOM.

Both of these will cause the cancelation of the Flink job to fail because the 
TMs are not available.

 

We should *reschedule* the cleanup logic if the FlinkDeployment is present.

And we can add a new ReconciliationState DELETING to indicate the 
FlinkSessionJob's status.

 

The logic will be


{code:java}
if the FlinkDeployment is not present
    delete the FlinkSessionJob object
else
    if the JM is not available
        reschedule
    else
        if cancel job successfully
            delete the FlinkSessionJob object
        else
            reschedule{code}
When we cancel the Flink job, we need to verify all the jobs with the same name 
have been deleted in case of the job id is changed after JM restarted.

 

 

  was:
Currently, when we remove the FlinkSessionJob object,

we always remove the object even if the Flink job is not being canceled 
successfully.

 

This is not semantic consistent if the FlinkSessionJob has been removed but the 
Flink job is still running.

 

One of the scenarios is that if we deploy a FlinkDeployment with HA mode.

When we remove the FlinkSessionJob and change the FlinkDeployment at the same 
time,

or if the TMs are restarting because of some bugs such as OOM.

Both of these will cause the cancelation of the Flink job to fail because the 
TMs are not available.

 

We should reschedule the cleanup logic if the FlinkDeployment is present.

And we can add a new ReconciliationState DELETING to indicate the 
FlinkSessionJob's status.


> Reschedule the cleanup logic if cancel job failed
> -------------------------------------------------
>
>                 Key: FLINK-29566
>                 URL: https://issues.apache.org/jira/browse/FLINK-29566
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>            Reporter: Xin Hao
>            Priority: Minor
>
> Currently, when we remove the FlinkSessionJob object,
> we always remove the object even if the Flink job is not being canceled 
> successfully.
>  
> This is *not semantic consistent* if the FlinkSessionJob has been removed but 
> the Flink job is still running.
>  
> One of the scenarios is that if we deploy a FlinkDeployment with HA mode.
> When we remove the FlinkSessionJob and change the FlinkDeployment at the same 
> time,
> or if the TMs are restarting because of some bugs such as OOM.
> Both of these will cause the cancelation of the Flink job to fail because the 
> TMs are not available.
>  
> We should *reschedule* the cleanup logic if the FlinkDeployment is present.
> And we can add a new ReconciliationState DELETING to indicate the 
> FlinkSessionJob's status.
>  
> The logic will be
> {code:java}
> if the FlinkDeployment is not present
>     delete the FlinkSessionJob object
> else
>     if the JM is not available
>         reschedule
>     else
>         if cancel job successfully
>             delete the FlinkSessionJob object
>         else
>             reschedule{code}
> When we cancel the Flink job, we need to verify all the jobs with the same 
> name have been deleted in case of the job id is changed after JM restarted.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to