[ 
https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036178#comment-17036178
 ] 

Piotr Nowojski edited comment on FLINK-16030 at 2/13/20 12:29 PM:
------------------------------------------------------------------

Could someone also explain what is the scenario when not having this heartbeat 
between task managers is causing some issues? We do have JM <-> TM heartbeats 
after all.

The setup is that there is an idling connection between an upstream TM and 
downstream TM, and upstream TM fails? Silently? Shouldn't the Job Manager 
detect this and trigger failover of the remaining TMs?


was (Author: pnowojski):
Could someone also explain what is the scenario when not having this heartbeat 
between task managers is causing some issues? 

The setup is that there is an idling connection between an upstream TM and 
downstream TM, and upstream TM fails? Silently? Shouldn't the Job Manager 
detect this and trigger failover of the remaining TMs?

> Add heartbeat between netty server and client to detect long connection alive
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-16030
>                 URL: https://issues.apache.org/jira/browse/FLINK-16030
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Network
>    Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0
>            Reporter: begginghard
>            Priority: Major
>
> As reported on [the user mailing 
> list|https://lists.apache.org/[email protected]:lte=1M:Encountered%20error%20while%20consuming%20partitions]
> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio 
> packet loss).  
> When the long tcp connection between netty client and server is lost, the 
> server would failed to send response to the client, then shut down the 
> channel. At the same time, the netty client does not know that the connection 
> has been disconnected, so it has been waiting for two hours.
> To detect the long tcp connection alive on netty client and server, we should 
> have two ways: tcp keepalive and heartbeat.
>  
> The tcp keepalive is 2 hours by default. When the long tcp connection dead, 
> you continue to wait for 2 hours, the netty client will trigger exception and 
> enter failover recovery.
> If you want to detect quickly, netty provides IdleStateHandler which it use 
> ping-pang mechanism. If netty client sends continuously n ping message and 
> receives no one pang message, then trigger exception.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to