[ 
https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036692#comment-17036692
 ] 

Zhijiang commented on FLINK-16030:
----------------------------------

I agree with [~pnowojski]'s concern. I forgot the previous issue that the netty 
thread might stuck in IO operations for blocking partition while reading data 
in some serious scenarios. It might cause the delay response for heartbeat ping 
message to bring unnecessary failure. The current netty handlers in flink stack 
are unified for both pipelined & blocking partitions, so we might not only 
consider the pipelined case.

Answer above [~pnowojski]'s question. The current heartbeat between TM/JM can 
not work well for this case. When the server side is aware of the network issue 
(local machine iptable issue), it would close the channel on its side and 
release all the partitions. But this can also happen in the normal case like 
when the client side send `CancelPartition|CloseRequest` message explicitly to 
close the channel, so it would throw any exception on server side to report JM. 
In short words the server side can not distinguish the cases while aware of 
inactive channel. 

When the server side closes its local channel, the client side would be aware 
of this issue after two hours(based on the default kernel keep-alive mechanism 
), so it would cause the whole job stuck until failure after two hours.

I guess there might other options to work around for this issue. If we can make 
the server side distinguish the different cases to cause inactive channels, 
then it can perform different actions to notify JM to trigger job failure.

> Add heartbeat between netty server and client to detect long connection alive
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-16030
>                 URL: https://issues.apache.org/jira/browse/FLINK-16030
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Network
>    Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0
>            Reporter: begginghard
>            Priority: Major
>
> As reported on [the user mailing 
> list|https://lists.apache.org/[email protected]:lte=1M:Encountered%20error%20while%20consuming%20partitions]
> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio 
> packet loss).  
> When the long tcp connection between netty client and server is lost, the 
> server would failed to send response to the client, then shut down the 
> channel. At the same time, the netty client does not know that the connection 
> has been disconnected, so it has been waiting for two hours.
> To detect the long tcp connection alive on netty client and server, we should 
> have two ways: tcp keepalive and heartbeat.
>  
> The tcp keepalive is 2 hours by default. When the long tcp connection dead, 
> you continue to wait for 2 hours, the netty client will trigger exception and 
> enter failover recovery.
> If you want to detect quickly, netty provides IdleStateHandler which it use 
> ping-pang mechanism. If netty client sends continuously n ping message and 
> receives no one pang message, then trigger exception.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to