[
https://issues.apache.org/jira/browse/FLINK-18050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zhijiang updated FLINK-18050:
-----------------------------
Description:
When task finishes, the `CheckpointBarrierUnaligner` will decline the current
checkpoint, which would write abort request into `ChannelStateWriter`.
The abort request will be executed before other write output request in the
queue, and close the underlying `CheckpointStateOutputStream`. Then when the
dispatcher executes the next write output request to access the stream, it will
throw ClosedByInterruptException to make dispatcher thread exit.
In this process, the underlying buffers for current write output request will
be recycled twice.
* ChannelStateCheckpointWriter#write will recycle all the buffers in finally
part, which can cover both exception and normal cases.
* ChannelStateWriteRequestDispatcherImpl#dispatch will call
`request.cancel(e)` to recycle the underlying buffers again in the case of
exception.
The effect of this bug can cause further exception in the network shuffle
process, which references the same buffer as above, then this exception will
send to the downstream side to make it failure.
This bug can be reproduced easily via running
UnalignedCheckpointITCase#shouldPerformUnalignedCheckpointOnParallelRemoteChannel.
was:
During ChannelStateWriteRequestDispatcherImpl#dispatch, `request.cancel(e)` is
called to recycle the internal buffer of request once exception happens.
But for the case of requesting write output, the buffers would be also finally
recycled inside ChannelStateCheckpointWriter#write no matter exceptions or not.
So the buffers in request will be recycled twice in the case of exception,
which would cause further exceptions in the network shuffle process to
reference the same buffer.
This bug can be reproduced easily via running
UnalignedCheckpointITCase#shouldPerformUnalignedCheckpointOnParallelRemoteChannel.
> Fix the bug of recycling buffer twice once exception in
> ChannelStateWriteRequestDispatcher#dispatch
> ---------------------------------------------------------------------------------------------------
>
> Key: FLINK-18050
> URL: https://issues.apache.org/jira/browse/FLINK-18050
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.11.0
> Reporter: Zhijiang
> Assignee: Zhijiang
> Priority: Blocker
> Fix For: 1.11.0, 1.12.0
>
>
> When task finishes, the `CheckpointBarrierUnaligner` will decline the current
> checkpoint, which would write abort request into `ChannelStateWriter`.
> The abort request will be executed before other write output request in the
> queue, and close the underlying `CheckpointStateOutputStream`. Then when the
> dispatcher executes the next write output request to access the stream, it
> will throw ClosedByInterruptException to make dispatcher thread exit.
> In this process, the underlying buffers for current write output request will
> be recycled twice.
> * ChannelStateCheckpointWriter#write will recycle all the buffers in finally
> part, which can cover both exception and normal cases.
> * ChannelStateWriteRequestDispatcherImpl#dispatch will call
> `request.cancel(e)` to recycle the underlying buffers again in the case of
> exception.
> The effect of this bug can cause further exception in the network shuffle
> process, which references the same buffer as above, then this exception will
> send to the downstream side to make it failure.
>
> This bug can be reproduced easily via running
> UnalignedCheckpointITCase#shouldPerformUnalignedCheckpointOnParallelRemoteChannel.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)