[jira] [Commented] (FLINK-30444) State recovery error not handled correctly and always causes JM failure

Jira Mon, 19 Dec 2022 09:21:06 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-30444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17649416#comment-17649416
 ]


David Morávek commented on FLINK-30444:
---------------------------------------

This is a deeper issue with how the savepoints are recovered when JobMaster 
starts up. If anything goes sideways during execution graph restore, we fail 
the job master, which is ultimately handled as a fatal exception by the 
dispatcher 
(_DefaultExecutionGraphFactory.tryRestoreExecutionGraphFromSavepoint_ is the 
relevant code path).

Since the job has already been registered for execution by the dispatcher, we 
should handle the exception and fail the job accordingly because we can't 
recover from this.

> This also not consistent with some other startup errors such as, missing 
> application jar. That causes a jobmanager restart loop, but does not put the 
> job a terminal FAILED state. This behaviour is more desirable as it doesn't 
> lead to empty application clusters on Kubernetes

This is a special class of errors that could be qualified as "pre-main method 
errors"; I think this is orthogonal to this issue and should be discussed 
separately.

> State recovery error not handled correctly and always causes JM failure
> -----------------------------------------------------------------------
>
>                 Key: FLINK-30444
>                 URL: https://issues.apache.org/jira/browse/FLINK-30444
>             Project: Flink
>          Issue Type: Bug
>          Components: Client / Job Submission
>    Affects Versions: 1.16.0, 1.14.6, 1.15.3
>            Reporter: Gyula Fora
>            Assignee: David Morávek
>            Priority: Critical
>
> When you submit a job in Application mode and you try to restore from an 
> incompatible savepoint, there is a very unexpected behaviour.
> Even with the following config:
> {noformat}
> execution.shutdown-on-application-finish: false
> execution.submit-failed-job-on-application-error: true{noformat}
> The job goes into a FAILED state, and the jobmanager fails. In a kubernetes 
> environment (when using the native kubernetes integration) this means that 
> the JobManager is restarted automatically.
> This will mean that if you have jobresult store enabled, after the JM comes 
> back you will end up with an empty application cluster.
> I think the correct behaviour would be, depending on the above mention config:
> 1. If there is a job recovery error and you have 
> (execution.submit-failed-job-on-application-error) configured, then the job 
> should show up as failed, and the JM should not exit (if 
> execution.shutdown-on-application-finish is false)
> 2. If (execution.shutdown-on-application-finish is true) then the jobmanager 
> should exit cleanly like on normal job terminal state and thus stop the 
> deployment in Kubernetes, preventing a JM restart cycle



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-30444) State recovery error not handled correctly and always causes JM failure

Reply via email to