[ 
https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036830#comment-17036830
 ] 

Zhijiang commented on FLINK-16030:
----------------------------------

After some offline discussions with [~pnowojski], we reach the agreement that 
it might be proper to enhance the server side to also trigger failure once 
detecting any exceptions, then the JM can handle the whole job restart.

Double reviewing the current codes, once the netty client detects any 
exceptions, it would notify the server side in best-effort way via 
`CancelPartition` and `ClosePartition` messages before closing channel. 
Meanwhile, it also triggers the respective task fail via 
`RemoteInputChannel#onError`.

But on netty server side, it only releases the view resources once detecting 
inactive channel. If it can also trigger task failure as client side does, then 
the JM can handle it well. We should also consider carefully to avoid 
misleading sometimes, because in normal case when the partition is consumed 
complete by downstream side, the inactive channel is caused by normal channel 
close and should not trigger any failure.

[~begginghard] After you think it through in this way, then we can further sync 
with it or discuss in PR page.

> Add heartbeat between netty server and client to detect long connection alive
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-16030
>                 URL: https://issues.apache.org/jira/browse/FLINK-16030
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Network
>    Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0
>            Reporter: begginghard
>            Priority: Major
>
> As reported on [the user mailing 
> list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions]
> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio 
> packet loss).  
> When the long tcp connection between netty client and server is lost, the 
> server would failed to send response to the client, then shut down the 
> channel. At the same time, the netty client does not know that the connection 
> has been disconnected, so it has been waiting for two hours.
> To detect the long tcp connection alive on netty client and server, we should 
> have two ways: tcp keepalive and heartbeat.
>  
> The tcp keepalive is 2 hours by default. When the long tcp connection dead, 
> you continue to wait for 2 hours, the netty client will trigger exception and 
> enter failover recovery.
> If you want to detect quickly, netty provides IdleStateHandler which it use 
> ping-pang mechanism. If netty client sends continuously n ping message and 
> receives no one pang message, then trigger exception.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to