[jira] [Updated] (FLINK-29415) InitializationFailure when recovering from a checkpoint in Application Mode leads to the cleanup of all HA data

Matthias Pohl (Jira) Mon, 26 Sep 2022 03:00:14 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-29415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Matthias Pohl updated FLINK-29415:
----------------------------------
    Description: This issue was raised in the user ML thread [JobManager 
restarts on job 
failure|https://lists.apache.org/thread/qkmcty3h4gkkx5g09m19gwqrf8z8d383]. 
Recovering from a external checkpoint is handled differently than recovering 
from an internal state (see 
[Dispatcher#handleJobManagerRunner|https://github.com/apache/flink/blob/41ac1ba13679121f1ddf14b26a36f4f4a3cc73e4/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L651]).
 For the latter case, we explicitly do a local cleanup (i.e. no HA data is 
cleaned up). For the case, described in the ML thread, a global cleanup is 
performed. That's not a problem in session mode where a new job ID is used. The 
new job ID will result in using a new namespace for the HA data. Data from 
previous runs are not touched during a cleanup. In Application mode, we use the 
default job ID `0` which would be reused. In case of a failure, all the HA data 
will be "namespaced" using the default job id. As a consequence, the related 
data is cleaned up.  (was: This issue was raised in the user ML thread 
[JobManager restarts on job 
failure|https://lists.apache.org/thread/qkmcty3h4gkkx5g09m19gwqrf8z8d383]. 
Recovering from a external checkpoint is handled differently than recovering 
from an internal state (see 
[Dispatcher#handleJobManagerRunner|https://github.com/apache/flink/blob/41ac1ba13679121f1ddf14b26a36f4f4a3cc73e4/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L651]).
 For the latter case, we explicitly do a local cleanup (i.e. no HA data is 
cleaned up). For the case, described in the ML thread, a global cleanup is 
performed. That's not a problem in session mode where a new job ID is used. But 
in Application mode, we use the default job ID `0` which would be reused. In 
case of a failure, all the HA data will be "namespaced" using the default job 
id. As a consequence, the related data is cleaned up.)

> InitializationFailure when recovering from a checkpoint in Application Mode 
> leads to the cleanup of all HA data
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-29415
>                 URL: https://issues.apache.org/jira/browse/FLINK-29415
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.16.0, 1.17.0, 1.15.2, 1.14.6
>            Reporter: Matthias Pohl
>            Priority: Major
>
> This issue was raised in the user ML thread [JobManager restarts on job 
> failure|https://lists.apache.org/thread/qkmcty3h4gkkx5g09m19gwqrf8z8d383]. 
> Recovering from a external checkpoint is handled differently than recovering 
> from an internal state (see 
> [Dispatcher#handleJobManagerRunner|https://github.com/apache/flink/blob/41ac1ba13679121f1ddf14b26a36f4f4a3cc73e4/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L651]).
>  For the latter case, we explicitly do a local cleanup (i.e. no HA data is 
> cleaned up). For the case, described in the ML thread, a global cleanup is 
> performed. That's not a problem in session mode where a new job ID is used. 
> The new job ID will result in using a new namespace for the HA data. Data 
> from previous runs are not touched during a cleanup. In Application mode, we 
> use the default job ID `0` which would be reused. In case of a failure, all 
> the HA data will be "namespaced" using the default job id. As a consequence, 
> the related data is cleaned up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-29415) InitializationFailure when recovering from a checkpoint in Application Mode leads to the cleanup of all HA data

Reply via email to