[ 
https://issues.apache.org/jira/browse/FLINK-28474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Nowojski updated FLINK-28474:
-----------------------------------
    Description: 
After Checkpoint abort, ChannelStateWriteResult should fail.

But if _channelStateWriter.start(id, checkpointOptions);_ is executed after 
Checkpoint abort, ChannelStateWriteResult will not fail.

 
h2. Cause Analysis:

When abort checkpoint, channelStateWriter.start(id, checkpointOptions); may not 
be executed yet. These checkpointIds will be stored in the abortedCheckpointIds 
of SubtaskCheckpointCoordinatorImpl, and when checkpointState is called, it 
will check if the checkpointId should be aborted.

_ChannelStateWriter.abort(checkpointId, exception, true) should also be 
executed here._

The unit test can reproduce this bug.

!image-2022-07-09-22-21-24-417.png|width=803,height=307!

 

Note: channelStateWriter.abort is only called in notifyCheckpointAborted, it 
doesn't account for channelStateWriter.start after notifyCheckpointAborted.

JIRA: FLINK-17869

commit: 
https://github.com/apache/flink/pull/12478/commits/22c99845ef4f863f1753d17b109fd2faecc8201e

 

The bug will affect the new feature FLINK-26803, because the channel state file 
can be closed only after the Checkpoints of all tasks of the shared file are 
complete or abort. So when the checkpoint of some tasks fails, if abort is not 
called, the file cannot be closed and all tasks sharing the file cannot execute 
inputChannelStateHandles.completeExceptionally(e); and 
resultSubpartitionStateHandles.completeExceptionally(e); , 
AsyncCheckpointRunnable will wait forever.

  was:
After Checkpoint abort, ChannelStateWriteResult should fail.

But if _channelStateWriter.start(id, checkpointOptions);_ is executed after 
Checkpoint abort, ChannelStateWriteResult will not fail.

 
h2. Cause Analysis:

When abort checkpoint, channelStateWriter.start(id, checkpointOptions); may not 
be executed yet. These checkpointIds will be stored in the abortedCheckpointIds 
of SubtaskCheckpointCoordinatorImpl, and when checkpointState is called, it 
will check if the checkpointId should be aborted.

_ChannelStateWriter.abort(checkpointId, exception, true) should also be 
executed here._

The unit test can reproduce this bug.

!image-2022-07-09-22-21-24-417.png|width=803,height=307!

 

Note: channelStateWriter.abort is only called in notifyCheckpointAborted, it 
doesn't account for channelStateWriter.start after notifyCheckpointAborted.

JIRA: FLINK-17869

commit: 
https://github.com/apache/flink/pull/12478/commits/22c99845ef4f863f1753d17b109fd2faecc8201e

 

 


> ChannelStateWriteResult may not fail after checkpoint abort
> -----------------------------------------------------------
>
>                 Key: FLINK-28474
>                 URL: https://issues.apache.org/jira/browse/FLINK-28474
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.14.5, 1.15.1
>            Reporter: fanrui
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.16.0, 1.15.2, 1.14.6
>
>         Attachments: image-2022-07-09-22-21-24-417.png
>
>
> After Checkpoint abort, ChannelStateWriteResult should fail.
> But if _channelStateWriter.start(id, checkpointOptions);_ is executed after 
> Checkpoint abort, ChannelStateWriteResult will not fail.
>  
> h2. Cause Analysis:
> When abort checkpoint, channelStateWriter.start(id, checkpointOptions); may 
> not be executed yet. These checkpointIds will be stored in the 
> abortedCheckpointIds of SubtaskCheckpointCoordinatorImpl, and when 
> checkpointState is called, it will check if the checkpointId should be 
> aborted.
> _ChannelStateWriter.abort(checkpointId, exception, true) should also be 
> executed here._
> The unit test can reproduce this bug.
> !image-2022-07-09-22-21-24-417.png|width=803,height=307!
>  
> Note: channelStateWriter.abort is only called in notifyCheckpointAborted, it 
> doesn't account for channelStateWriter.start after notifyCheckpointAborted.
> JIRA: FLINK-17869
> commit: 
> https://github.com/apache/flink/pull/12478/commits/22c99845ef4f863f1753d17b109fd2faecc8201e
>  
> The bug will affect the new feature FLINK-26803, because the channel state 
> file can be closed only after the Checkpoints of all tasks of the shared file 
> are complete or abort. So when the checkpoint of some tasks fails, if abort 
> is not called, the file cannot be closed and all tasks sharing the file 
> cannot execute inputChannelStateHandles.completeExceptionally(e); and 
> resultSubpartitionStateHandles.completeExceptionally(e); , 
> AsyncCheckpointRunnable will wait forever.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to