Zhijiang created FLINK-17992:
--------------------------------
Summary: Exception from RemoteInputChannel#onBuffer should not
fail the whole NetworkClientHandler
Key: FLINK-17992
URL: https://issues.apache.org/jira/browse/FLINK-17992
Project: Flink
Issue Type: Bug
Components: Runtime / Network
Affects Versions: 1.10.1, 1.10.0
Reporter: Zhijiang
Assignee: Zhijiang
Fix For: 1.11.0
RemoteInputChannel#onBuffer is invoked by
CreditBasedPartitionRequestClientHandler while receiving and decoding the
network data. #onBuffer can throw exceptions which would tag the error in
client handler and fail all the added input channels inside handler. Then it
would cause a tricky potential issue as following.
If the RemoteInputChannel is canceling by canceler thread, then the task thread
might exit early than canceler thread terminate. That means the
PartitionRequestClient might not be closed (triggered by canceler thread) while
the new task attempt is already deployed into this TaskManger. Therefore the
new task might reuse the previous PartitionRequestClient while requesting
partitions, but note that the respective client handler was already tagged an
error before during above RemoteInputChannel#onBuffer. It will cause the next
round unnecessary failover.
It is hard to find this potential issue in production because it can be
restored normal finally after one or more additional failover. We find this
potential problem from UnalignedCheckpointITCase because it will define the
precise restart times within configured failures.
The solution is to only fail the respective task when its internal
RemoteInputChannel#onBuffer throws any exceptions instead of failing the whole
channels inside client handler, then the client is still health and can also be
reused by other input channels as long as it is not released yet.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)