[ 
https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038512#comment-17038512
 ] 

Zhijiang commented on FLINK-16030:
----------------------------------

Considering the direction of "richer exception handling" [~sewen]  mentioned, I 
think we already followed this way for some previous cases. E.g. 
`PartitionNotFoundException` is reported on downstream side while requesting 
upstream's partition failure, then the JM can check the upstream task's state 
to make a decision. 

The current exception detection and report mechanisms are a bit different in 
netty client and server sides.
 * In client side, any exceptions detected by netty handler would cause the 
respective tasks enter failed state. So JM is aware of the exceptions via task 
state reporting.  Besides that, the client would also report to netty server 
side via network in `CancelPartitionRequest` and `CloseRequest` messages. The 
netty server side would release view resources as a result, but not fail the 
respective upstream's tasks which might be canceled by JM if necessary.
 * In server side, any exceptions detected by netty handler would not cause the 
respective tasks fail, so it also does not report to JM ATM. But it would 
notify the netty client side via `ErrorResponse` message in some cases 
(`PartitionRequestQueue#exceptionCaught`). If the client handler receives the 
error message, then it would cause the downstream' task fail to report to JM. 
JM can cancel the upstream's task if necessary.

So for both client and server sides in network, we already have the exception 
detection mechanism, but missing the effective report mechanism in some cases. 
The previous proposal for adding ping message in network is also for the 
detection mechanism in essence. But if it relies on the task failure to realize 
the report mechanism, it would bring unnecessary job restart for fake ping 
timeout.

Considering this ticket case, I think we could still follow the "richer 
exception handling" direction to some extent. For confirmed exceptions, we can 
rely on the task failure to report JM as did before. For ambiguous exceptions 
we should have a mechanism to report to JM directly such as 
`PartitionNotFoundException` did, then the JM has the global information for 
final decision. E.g. the JM can inquire the other side state, or wait for a 
while because of the other side report delay, or even send RPC to ask for the 
other side if necessary before making decisions.

> Add heartbeat between netty server and client to detect long connection alive
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-16030
>                 URL: https://issues.apache.org/jira/browse/FLINK-16030
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Network
>    Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0
>            Reporter: begginghard
>            Assignee: begginghard
>            Priority: Major
>
> As reported on [the user mailing 
> list|https://lists.apache.org/[email protected]:lte=1M:Encountered%20error%20while%20consuming%20partitions]
> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio 
> packet loss).  
> When the long tcp connection between netty client and server is lost, the 
> server would failed to send response to the client, then shut down the 
> channel. At the same time, the netty client does not know that the connection 
> has been disconnected, so it has been waiting for two hours.
> To detect the long tcp connection alive on netty client and server, we should 
> have two ways: tcp keepalive and heartbeat.
>  
> The tcp keepalive is 2 hours by default. When the long tcp connection dead, 
> you continue to wait for 2 hours, the netty client will trigger exception and 
> enter failover recovery.
> If you want to detect quickly, netty provides IdleStateHandler which it use 
> ping-pang mechanism. If netty client sends continuously n ping message and 
> receives no one pang message, then trigger exception.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to