[jira] [Created] (FLINK-17992) Exception from RemoteInputChannel#onBuffer should not fail the whole NetworkClientHandler

Zhijiang (Jira) Wed, 27 May 2020 19:59:24 -0700

Zhijiang created FLINK-17992:
--------------------------------

             Summary: Exception from RemoteInputChannel#onBuffer should not 
fail the whole NetworkClientHandler
                 Key: FLINK-17992
                 URL: https://issues.apache.org/jira/browse/FLINK-17992
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Network
    Affects Versions: 1.10.1, 1.10.0
            Reporter: Zhijiang
            Assignee: Zhijiang
             Fix For: 1.11.0



RemoteInputChannel#onBuffer is invoked by 
CreditBasedPartitionRequestClientHandler while receiving and decoding the 
network data. #onBuffer can throw exceptions which would tag the error in 
client handler and fail all the added input channels inside handler. Then it 
would cause a tricky potential issue as following.

If the RemoteInputChannel is canceling by canceler thread, then the task thread 
might exit early than canceler thread terminate. That means the 
PartitionRequestClient might not be closed (triggered by canceler thread) while 
the new task attempt is already deployed into this TaskManger. Therefore the 
new task might reuse the previous PartitionRequestClient while requesting 
partitions, but note that the respective client handler was already tagged an 
error before during above RemoteInputChannel#onBuffer. It will cause the next 
round unnecessary failover.

It is hard to find this potential issue in production because it can be 
restored normal finally after one or more additional failover. We find this 
potential problem from UnalignedCheckpointITCase because it will define the 
precise restart times within configured failures.

The solution is to only fail the respective task when its internal 
RemoteInputChannel#onBuffer throws any exceptions instead of failing the whole 
channels inside client handler, then the client is still health and can also be 
reused by other input channels as long as it is not released yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-17992) Exception from RemoteInputChannel#onBuffer should not fail the whole NetworkClientHandler

Reply via email to