[ https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037105#comment-17037105 ]
Liu commented on FLINK-16030: ----------------------------- Sorry for late reply. For quick fix, I send ping message to server and expect to receive pong message in the client side. If the client can not receive pong message for some time, such as 3 seconds, then it fails the job. Thanks for that so many people are interesting in this bug. Expect for better solution. > Add heartbeat between netty server and client to detect long connection alive > ----------------------------------------------------------------------------- > > Key: FLINK-16030 > URL: https://issues.apache.org/jira/browse/FLINK-16030 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network > Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0 > Reporter: begginghard > Assignee: begginghard > Priority: Major > > As reported on [the user mailing > list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions] > Network can fail in many ways, sometimes pretty subtle (e.g. high ratio > packet loss). > When the long tcp connection between netty client and server is lost, the > server would failed to send response to the client, then shut down the > channel. At the same time, the netty client does not know that the connection > has been disconnected, so it has been waiting for two hours. > To detect the long tcp connection alive on netty client and server, we should > have two ways: tcp keepalive and heartbeat. > > The tcp keepalive is 2 hours by default. When the long tcp connection dead, > you continue to wait for 2 hours, the netty client will trigger exception and > enter failover recovery. > If you want to detect quickly, netty provides IdleStateHandler which it use > ping-pang mechanism. If netty client sends continuously n ping message and > receives no one pang message, then trigger exception. > -- This message was sent by Atlassian Jira (v8.3.4#803005)