Till Rohrmann created FLINK-21979:
-------------------------------------
Summary: Job can be restarted from the beginning after it reached
a terminal state
Key: FLINK-21979
URL: https://issues.apache.org/jira/browse/FLINK-21979
Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Affects Versions: 1.12.2, 1.11.3, 1.13.0
Reporter: Till Rohrmann
Fix For: 1.14.0
Currently, the {{JobMaster}} removes all checkpoints after a job reaches a
globally terminal state. Then it notifies the {{Dispatcher}} about the
termination of the job. The {{Dispatcher}} then removes the job from the
{{SubmittedJobGraphStore}}. If the {{Dispatcher}} process fails before doing
that it might get restarted. In this case, the {{Dispatcher}} would still find
the job in the {{SubmittedJobGraphStore}} and recover it. Since the
{{CompletedCheckpointStore}} is empty, it would start executing this job from
the beginning.
I think we must not remove job state before the job has not been marked as done
or made inaccessible for any restarted processes. Concretely, we should first
remove the job from the {{SubmittedJobGraphStore}} and only then delete the
checkpoints. Ideally all the job related cleanup operation happens atomically.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)