Zhijiang Wang created FLINK-4021:
------------------------------------

             Summary: Problem of setting autoread for netty channel when more 
tasks sharing the same Tcp connection
                 Key: FLINK-4021
                 URL: https://issues.apache.org/jira/browse/FLINK-4021
             Project: Flink
          Issue Type: Bug
          Components: Distributed Runtime
    Affects Versions: 1.0.2
            Reporter: Zhijiang Wang
            Assignee: Zhijiang Wang
             Fix For: 1.1.0


More than one task sharing the same Tcp connection for shuffling data.
If the downstream task said as "A" has no available memory segment to read 
netty buffer from network, it will set autoread as false for the channel.
When the task A is failed or has available segments again, the netty handler 
will be notified to process the staging buffers first, then reset autoread as 
true. But in some scenarios, the autoread will not be set as true any more.
That is when processing staging buffers, first find the corresponding input 
channel for the buffer, if the task for that input channel is failed, the 
decodeMsg method in PartitionRequestClientHandler will return false, that means 
setting autoread as true will not be done anymore.
In summary,  if one task "A" sets the autoread as false because of no available 
segments, and resulting in some staging buffers. If another task "B" is failed 
by accident corresponding to one staging buffer. When task A trys to reset 
autoread as true, the process can not work because of task B failed.
I have fixed this problem in our application by adding one boolean parameter in 
decodeBufferOrEvent method to distinguish whether this method is invoke by 
netty IO thread channel read or staged message handler task in 
PartitionRequestClientHandler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to