Till Rohrmann created FLINK-22502:
-------------------------------------

             Summary: DefaultCompletedCheckpointStore drops unrecoverable 
checkpoints silently
                 Key: FLINK-22502
                 URL: https://issues.apache.org/jira/browse/FLINK-22502
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing, Runtime / Coordination
    Affects Versions: 1.12.2, 1.11.3, 1.13.0, 1.14.0
            Reporter: Till Rohrmann
             Fix For: 1.14.0, 1.13.1, 1.12.4


The {{DefaultCompletedCheckpointStore.recover()}} tries to be resilient if it 
cannot recover a checkpoint (e.g. due to a transient storage outage or a 
checkpoint being corrupted). This behaviour was introduced with FLINK-7783.

The problem is that this behaviour might cause us to ignore the latest valid 
checkpoint if there is a transient problem when restoring it. This might be ok 
for at least once processing guarantees, but it clearly violates exactly once 
processing guarantees. On top of it, it is very hard to spot.

I propose to change this behaviour so that 
{{DefaultCompletedCheckpointStore.recover()}} fails if it cannot read the 
checkpoints it is supposed to read. If the {{recover}} method fails during a 
recovery, it will kill the process. This will usually restart the process which 
will retry the checkpoint recover operation. If the problem is of transient 
nature, then it should eventually succeed. In case that this problem occurs 
during an initial job submission, then the job will directly transition to a 
{{FAILED}} state.

The proposed behaviour entails that if there is a permanent problem with the 
checkpoint (e.g. corrupted checkpoint), then Flink won't be able to recover 
without the intervention of the user. I believe that this is the right decision 
because Flink can no longer give exactly once guarantees in this situation and 
a user needs to explicitly resolve this situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to