[jira] [Closed] (FLINK-21928) DuplicateJobSubmissionException after JobManager failover

Till Rohrmann (Jira) Thu, 15 Jul 2021 02:29:11 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-21928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Till Rohrmann closed FLINK-21928.
---------------------------------
    Release Note: 
The fix for this problem only works if the ApplicationMode is used with a 
single job submission and if the user code does not access the 
`JobExecutionResult`. If any of these conditions is violated, then Flink cannot 
guarantee that the whole Flink application is executed.

Additionally, it is still required that the user cleans up the corresponding HA 
entries for the running jobs registry because these entries won't be reliably 
cleaned up when encountering the situation described by FLINK-21928.
      Resolution: Fixed

> DuplicateJobSubmissionException after JobManager failover
> ---------------------------------------------------------
>
>                 Key: FLINK-21928
>                 URL: https://issues.apache.org/jira/browse/FLINK-21928
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.3, 1.11.3, 1.12.2, 1.13.0
>         Environment: StandaloneApplicationClusterEntryPoint using a fixed job 
> ID, High Availability enabled
>            Reporter: Ufuk Celebi
>            Assignee: David Morávek
>            Priority: Critical
>              Labels: pull-request-available, stale-critical
>             Fix For: 1.14.0
>
>
> Consider the following scenario:
>  * Environment: StandaloneApplicationClusterEntryPoint using a fixed job ID, 
> high availability enabled
>  * Flink job reaches a globally terminal state
>  * Flink job is marked as finished in the high-availability service's 
> RunningJobsRegistry
>  * The JobManager fails over
> On recovery, the [Dispatcher throws DuplicateJobSubmissionException, because 
> the job is marked as done in the 
> RunningJobsRegistry|https://github.com/apache/flink/blob/release-1.12.2/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L332-L340].
> When this happens, users cannot get out of the situation without manually 
> redeploying the JobManager process and changing the job ID^1^.
> The desired semantics are that we don't want to re-execute a job that has 
> reached a globally terminal state. In this particular case, we know that the 
> job has already reached such a state (as it has been marked in the registry). 
> Therefore, we could handle this case by executing the regular termination 
> sequence instead of throwing a DuplicateJobSubmission.
> ---
> ^1^ With ZooKeeper HA, the respective node is not ephemeral. In Kubernetes 
> HA, there is no  notion of ephemeral data that is tied to a session in the 
> first place afaik.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (FLINK-21928) DuplicateJobSubmissionException after JobManager failover

Reply via email to