[ https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036196#comment-17036196 ]
begginghard commented on FLINK-16030: ------------------------------------- [~pnowojski] I have reproduce the problem in the test environment. The problem must be occurred after I disable transfer data from netty server to client by iptables, but keep alive from netty client to server. > Add heartbeat between netty server and client to detect long connection alive > ----------------------------------------------------------------------------- > > Key: FLINK-16030 > URL: https://issues.apache.org/jira/browse/FLINK-16030 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network > Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0 > Reporter: begginghard > Priority: Major > > As reported on [the user mailing > list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions] > Network can fail in many ways, sometimes pretty subtle (e.g. high ratio > packet loss). > When the long tcp connection between netty client and server is lost, the > server would failed to send response to the client, then shut down the > channel. At the same time, the netty client does not know that the connection > has been disconnected, so it has been waiting for two hours. > To detect the long tcp connection alive on netty client and server, we should > have two ways: tcp keepalive and heartbeat. > > The tcp keepalive is 2 hours by default. When the long tcp connection dead, > you continue to wait for 2 hours, the netty client will trigger exception and > enter failover recovery. > If you want to detect quickly, netty provides IdleStateHandler which it use > ping-pang mechanism. If netty client sends continuously n ping message and > receives no one pang message, then trigger exception. > -- This message was sent by Atlassian Jira (v8.3.4#803005)