Joyce.Li created FLINK-25401:
--------------------------------
Summary: DefaultCompletedCheckpointStore may not return the latest
CompletedCheckpoint after JM failover.
Key: FLINK-25401
URL: https://issues.apache.org/jira/browse/FLINK-25401
Project: Flink
Issue Type: Bug
Components: Runtime / Checkpointing
Reporter: Joyce.Li
At present, when we recover {{{}DefaultCompletedCheckpointStore{}}}, we use the
character order to sort the {{{}CompletedCheckpoint{}}}.
{code:java}
// Get all there is first.
final List<Tuple2<RetrievableStateHandle<CompletedCheckpoint>, String>>
initialCheckpoints =
checkpointStateHandleStore.getAllAndLock();
// Sort checkpoints by name.
initialCheckpoints.sort(Comparator.comparing(o -> o.f1));{code}
But considering this situation, for example, we reserve 3
{{{}CompletedCheckpoint{}}}, their IDÂ are 99, 100, 101, after JM failover,
DefaultCompletedCheckpointStore will restore these three
{{{}CompletedCheckpoint{}}}, but the order will become 100, 101, 99 . When we
restore the state of the job, we will use the {{CompletedCheckpoint}} with ID
99 to restore, which will cause an error.
I think we should use {{CheckpointStoreUtil#nameToCheckpointID}} to convert the
{{String}} to {{long}} before sorting.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)