[
https://issues.apache.org/jira/browse/FLINK-29415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matthias Pohl closed FLINK-29415.
---------------------------------
Resolution: Invalid
There is a solution for this specific case already with
{{execution.submit-failed-job-on-application-error=true}} being introduced with
FLINK-25715 in Flink 1.15 as [~gyfora] pointed out in the ML thread. I'm gonna
close this issue again.
> InitializationFailure when recovering from a checkpoint in Application Mode
> leads to the cleanup of all HA data
> ---------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-29415
> URL: https://issues.apache.org/jira/browse/FLINK-29415
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.16.0, 1.17.0, 1.15.2, 1.14.6
> Reporter: Matthias Pohl
> Priority: Major
>
> This issue was raised in the user ML thread [JobManager restarts on job
> failure|https://lists.apache.org/thread/qkmcty3h4gkkx5g09m19gwqrf8z8d383].
> Recovering from a external checkpoint is handled differently than recovering
> from an internal state (see
> [Dispatcher#handleJobManagerRunner|https://github.com/apache/flink/blob/41ac1ba13679121f1ddf14b26a36f4f4a3cc73e4/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L651]).
> For the latter case, we explicitly do a local cleanup (i.e. no HA data is
> cleaned up). For the case, described in the ML thread, a global cleanup is
> performed. That's not a problem in session mode where a new job ID is used.
> The new job ID will result in using a new namespace for the HA data. Data
> from previous runs are not touched during a cleanup. In Application mode, we
> use the default job ID `0` which would be reused. In case of a failure, all
> the HA data will be "namespaced" using the default job id. As a consequence,
> the related data is cleaned up.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)