[ https://issues.apache.org/jira/browse/FLINK-18050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhijiang updated FLINK-18050: ----------------------------- Description: When task finishes, the `CheckpointBarrierUnaligner` will decline the current checkpoint, which would write abort request into `ChannelStateWriter`. The abort request will be executed before other write output request in the queue, and close the underlying `CheckpointStateOutputStream`. Then when the dispatcher executes the next write output request to access the stream, it will throw ClosedByInterruptException to make dispatcher thread exit. In this process, the underlying buffers for current write output request will be recycled twice. * ChannelStateCheckpointWriter#write will recycle all the buffers in finally part, which can cover both exception and normal cases. * ChannelStateWriteRequestDispatcherImpl#dispatch will call `request.cancel(e)` to recycle the underlying buffers again in the case of exception. The effect of this bug can cause further exception in the network shuffle process, which references the same buffer as above, then this exception will send to the downstream side to make it failure. This bug can be reproduced easily via running UnalignedCheckpointITCase#shouldPerformUnalignedCheckpointOnParallelRemoteChannel. was: During ChannelStateWriteRequestDispatcherImpl#dispatch, `request.cancel(e)` is called to recycle the internal buffer of request once exception happens. But for the case of requesting write output, the buffers would be also finally recycled inside ChannelStateCheckpointWriter#write no matter exceptions or not. So the buffers in request will be recycled twice in the case of exception, which would cause further exceptions in the network shuffle process to reference the same buffer. This bug can be reproduced easily via running UnalignedCheckpointITCase#shouldPerformUnalignedCheckpointOnParallelRemoteChannel. > Fix the bug of recycling buffer twice once exception in > ChannelStateWriteRequestDispatcher#dispatch > --------------------------------------------------------------------------------------------------- > > Key: FLINK-18050 > URL: https://issues.apache.org/jira/browse/FLINK-18050 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.11.0 > Reporter: Zhijiang > Assignee: Zhijiang > Priority: Blocker > Fix For: 1.11.0, 1.12.0 > > > When task finishes, the `CheckpointBarrierUnaligner` will decline the current > checkpoint, which would write abort request into `ChannelStateWriter`. > The abort request will be executed before other write output request in the > queue, and close the underlying `CheckpointStateOutputStream`. Then when the > dispatcher executes the next write output request to access the stream, it > will throw ClosedByInterruptException to make dispatcher thread exit. > In this process, the underlying buffers for current write output request will > be recycled twice. > * ChannelStateCheckpointWriter#write will recycle all the buffers in finally > part, which can cover both exception and normal cases. > * ChannelStateWriteRequestDispatcherImpl#dispatch will call > `request.cancel(e)` to recycle the underlying buffers again in the case of > exception. > The effect of this bug can cause further exception in the network shuffle > process, which references the same buffer as above, then this exception will > send to the downstream side to make it failure. > > This bug can be reproduced easily via running > UnalignedCheckpointITCase#shouldPerformUnalignedCheckpointOnParallelRemoteChannel. -- This message was sent by Atlassian Jira (v8.3.4#803005)