[GitHub] [flink] 1996fanrui commented on a diff in pull request #20233: [FLINK-28474][checkpoint] Fix the bug ChannelStateWriteResult might not fail after checkpoint abort

GitBox Wed, 13 Jul 2022 09:07:30 -0700


1996fanrui commented on code in PR #20233:
URL: https://github.com/apache/flink/pull/20233#discussion_r920260335



##########
flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/SubtaskCheckpointCoordinatorImpl.java:
##########
@@ -316,6 +316,10 @@ public void checkpointState(
             // broadcast cancel checkpoint marker to avoid downstream 
back-pressure due to
             // checkpoint barrier align.
             operatorChain.broadcastEvent(new 
CancelCheckpointMarker(metadata.getCheckpointId()));
+            channelStateWriter.abort(
+                    metadata.getCheckpointId(),
+                    new CancellationException("checkpoint aborted via 
notification"),
+                    true);

Review Comment:
   Thanks for your clarification and reminder, this is really a bug. 
   
   I have updated this PR that fixed two other places where remove aborted 
checkpoint.
   
   
   > What exact issue was this bug causing in 
[FLINK-26803](https://issues.apache.org/jira/browse/FLINK-26803) and what were 
the symptoms?
   
   The bug will affect the new feature 
[FLINK-26803](https://issues.apache.org/jira/browse/FLINK-26803), because the 
channel state file can be closed only after the Checkpoints of all tasks of the 
shared file are complete or abort. So when the checkpoint of some tasks fails, 
if abort is not called, the file cannot be closed and all tasks sharing the 
file cannot execute `inputChannelStateHandles.completeExceptionally(e);` and 
`resultSubpartitionStateHandles.completeExceptionally(e);` , these 
AsyncCheckpointRunnable of shared tasks will wait forever.
   
   If the channel state file is not shared between tasks, the current task will 
not create an AsyncCheckpointRunnable, so the current task and other tasks will 
not be affected.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] 1996fanrui commented on a diff in pull request #20233: [FLINK-28474][checkpoint] Fix the bug ChannelStateWriteResult might not fail after checkpoint abort

Reply via email to