1996fanrui commented on code in PR #20233:
URL: https://github.com/apache/flink/pull/20233#discussion_r920260335
##########
flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/SubtaskCheckpointCoordinatorImpl.java:
##########
@@ -316,6 +316,10 @@ public void checkpointState(
// broadcast cancel checkpoint marker to avoid downstream
back-pressure due to
// checkpoint barrier align.
operatorChain.broadcastEvent(new
CancelCheckpointMarker(metadata.getCheckpointId()));
+ channelStateWriter.abort(
+ metadata.getCheckpointId(),
+ new CancellationException("checkpoint aborted via
notification"),
+ true);
Review Comment:
Thanks for your clarification and reminder, this is really a bug.
I have updated this PR that fixed two other places where remove aborted
checkpoint.
> What exact issue was this bug causing in
[FLINK-26803](https://issues.apache.org/jira/browse/FLINK-26803) and what were
the symptoms?
The bug will affect the new feature
[FLINK-26803](https://issues.apache.org/jira/browse/FLINK-26803), because the
channel state file can be closed only after the Checkpoints of all tasks of the
shared file are complete or abort. So when the checkpoint of some tasks fails,
if abort is not called, the file cannot be closed and all tasks sharing the
file cannot execute `inputChannelStateHandles.completeExceptionally(e);` and
`resultSubpartitionStateHandles.completeExceptionally(e);` , these
AsyncCheckpointRunnable of shared tasks will wait forever.
If the channel state file is not shared between tasks, the current task will
not create an AsyncCheckpointRunnable, so the current task and other tasks will
not be affected.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]