[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive

Piotr Nowojski (Jira) Tue, 18 Feb 2020 00:30:13 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038880#comment-17038880
 ]


Piotr Nowojski commented on FLINK-16030:
----------------------------------------

{quote}
It is probably fair to say that in cases of "non-recoverable pipelined" 
partitions, the sender should handle the exception directly as well.
{quote}
I think this is important to keep in mind here. Indeed downstream failures 
(timeout detected on the upstream node) should in some cases (retry-able 
partition) just cause downstream node to failover, but in others (pipelined) 
failover of both upstream and downstream task.

> Add heartbeat between netty server and client to detect long connection alive
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-16030
>                 URL: https://issues.apache.org/jira/browse/FLINK-16030
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Network
>    Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0
>            Reporter: begginghard
>            Assignee: begginghard
>            Priority: Major
>
> As reported on [the user mailing 
> list|https://lists.apache.org/[email protected]:lte=1M:Encountered%20error%20while%20consuming%20partitions]
> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio 
> packet loss).  
> When the long tcp connection between netty client and server is lost, the 
> server would failed to send response to the client, then shut down the 
> channel. At the same time, the netty client does not know that the connection 
> has been disconnected, so it has been waiting for two hours.
> To detect the long tcp connection alive on netty client and server, we should 
> have two ways: tcp keepalive and heartbeat.
>  
> The tcp keepalive is 2 hours by default. When the long tcp connection dead, 
> you continue to wait for 2 hours, the netty client will trigger exception and 
> enter failover recovery.
> If you want to detect quickly, netty provides IdleStateHandler which it use 
> ping-pang mechanism. If netty client sends continuously n ping message and 
> receives no one pang message, then trigger exception.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive

Reply via email to