[jira] [Commented] (FLINK-30444) State recovery error not handled correctly and always causes JM failure

Jira Fri, 23 Dec 2022 01:58:05 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-30444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17651568#comment-17651568
 ]


David Morávek commented on FLINK-30444:
---------------------------------------

> I think the JobManager shutting down is inconsistent with the 
> execution.shutdown-on-application-finish configuration.

Right now, it depends on when the fatal error happens. Suppose the _fatal error 
handler_ catches it in the dispatcher. In that case, we might be limited in 
what we can do because we can no longer guarantee that the APIs are accessible 
/ returning correct results. There might be cases when we can do something, 
just like with the savepoint recovery, because it's only tied to a particular 
job and not the whole cluster.

Another consideration is that the _execution.shutdown-on-application-finish_ 
should only take effect when the job reaches the terminal state (FAILED, 
FINISHED, CANCELLED), which might not be the case when we encounter fatal error 
(that usually leads to the process restart and recovery mechanism kicking in).

> All other fatal job errors simply leave the jobmanager there.

It might be a bug if the job doesn't reach the terminal state.

 

I'm off for the holidays. I'll give this more thought next year.

> State recovery error not handled correctly and always causes JM failure
> -----------------------------------------------------------------------
>
>                 Key: FLINK-30444
>                 URL: https://issues.apache.org/jira/browse/FLINK-30444
>             Project: Flink
>          Issue Type: Bug
>          Components: Client / Job Submission
>    Affects Versions: 1.16.0, 1.14.6, 1.15.3
>            Reporter: Gyula Fora
>            Assignee: David Morávek
>            Priority: Critical
>
> When you submit a job in Application mode and you try to restore from an 
> incompatible savepoint, there is a very unexpected behaviour.
> Even with the following config:
> {noformat}
> execution.shutdown-on-application-finish: false
> execution.submit-failed-job-on-application-error: true{noformat}
> The job goes into a FAILED state, and the jobmanager fails. In a kubernetes 
> environment (when using the native kubernetes integration) this means that 
> the JobManager is restarted automatically.
> This will mean that if you have jobresult store enabled, after the JM comes 
> back you will end up with an empty application cluster.
> I think the correct behaviour would be, depending on the above mention config:
> 1. If there is a job recovery error and you have 
> (execution.submit-failed-job-on-application-error) configured, then the job 
> should show up as failed, and the JM should not exit (if 
> execution.shutdown-on-application-finish is false)
> 2. If (execution.shutdown-on-application-finish is true) then the jobmanager 
> should exit cleanly like on normal job terminal state and thus stop the 
> deployment in Kubernetes, preventing a JM restart cycle



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-30444) State recovery error not handled correctly and always causes JM failure

Reply via email to