[ 
https://issues.apache.org/jira/browse/FLINK-18050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-18050:
-----------------------------
    Description: 
When task finishes, the `CheckpointBarrierUnaligner` will decline the current 
checkpoint, which would write abort request into `ChannelStateWriter`.

The abort request will be executed before other write output request in the 
queue, and close the underlying `CheckpointStateOutputStream`. Then when the 
dispatcher executes the next write output request to access the stream, it will 
throw ClosedByInterruptException to make dispatcher thread exit.

In this process, the underlying buffers for current write output request will 
be recycled twice. 
 * ChannelStateCheckpointWriter#write will recycle all the buffers in finally 
part, which can cover both exception and normal cases.
 * ChannelStateWriteRequestDispatcherImpl#dispatch will call 
`request.cancel(e)`  to recycle the underlying buffers again in the case of 
exception.

The effect of this bug can cause further exception in the network shuffle 
process, which references the same buffer as above, then this exception will 
send to the downstream side to make it failure.

 

This bug can be reproduced easily via running 
UnalignedCheckpointITCase#shouldPerformUnalignedCheckpointOnParallelRemoteChannel.

  was:
During ChannelStateWriteRequestDispatcherImpl#dispatch, `request.cancel(e)` is 
called to recycle the internal buffer of request once exception happens.

But for the case of requesting write output, the buffers would be also finally 
recycled inside ChannelStateCheckpointWriter#write no matter exceptions or not. 
So the buffers in request will be recycled twice in the case of exception, 
which would cause further exceptions in the network shuffle process to 
reference the same buffer.

This bug can be reproduced easily via running 
UnalignedCheckpointITCase#shouldPerformUnalignedCheckpointOnParallelRemoteChannel.


> Fix the bug of recycling buffer twice once exception in 
> ChannelStateWriteRequestDispatcher#dispatch
> ---------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-18050
>                 URL: https://issues.apache.org/jira/browse/FLINK-18050
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.11.0
>            Reporter: Zhijiang
>            Assignee: Zhijiang
>            Priority: Blocker
>             Fix For: 1.11.0, 1.12.0
>
>
> When task finishes, the `CheckpointBarrierUnaligner` will decline the current 
> checkpoint, which would write abort request into `ChannelStateWriter`.
> The abort request will be executed before other write output request in the 
> queue, and close the underlying `CheckpointStateOutputStream`. Then when the 
> dispatcher executes the next write output request to access the stream, it 
> will throw ClosedByInterruptException to make dispatcher thread exit.
> In this process, the underlying buffers for current write output request will 
> be recycled twice. 
>  * ChannelStateCheckpointWriter#write will recycle all the buffers in finally 
> part, which can cover both exception and normal cases.
>  * ChannelStateWriteRequestDispatcherImpl#dispatch will call 
> `request.cancel(e)`  to recycle the underlying buffers again in the case of 
> exception.
> The effect of this bug can cause further exception in the network shuffle 
> process, which references the same buffer as above, then this exception will 
> send to the downstream side to make it failure.
>  
> This bug can be reproduced easily via running 
> UnalignedCheckpointITCase#shouldPerformUnalignedCheckpointOnParallelRemoteChannel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to