[
https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038512#comment-17038512
]
Zhijiang commented on FLINK-16030:
----------------------------------
Considering the direction of "richer exception handling" [~sewen] mentioned, I
think we already followed this way for some previous cases. E.g.
`PartitionNotFoundException` is reported on downstream side while requesting
upstream's partition failure, then the JM can check the upstream task's state
to make a decision.
The current exception detection and report mechanisms are a bit different in
netty client and server sides.
* In client side, any exceptions detected by netty handler would cause the
respective tasks enter failed state. So JM is aware of the exceptions via task
state reporting. Besides that, the client would also report to netty server
side via network in `CancelPartitionRequest` and `CloseRequest` messages. The
netty server side would release view resources as a result, but not fail the
respective upstream's tasks which might be canceled by JM if necessary.
* In server side, any exceptions detected by netty handler would not cause the
respective tasks fail, so it also does not report to JM ATM. But it would
notify the netty client side via `ErrorResponse` message in some cases
(`PartitionRequestQueue#exceptionCaught`). If the client handler receives the
error message, then it would cause the downstream' task fail to report to JM.
JM can cancel the upstream's task if necessary.
So for both client and server sides in network, we already have the exception
detection mechanism, but missing the effective report mechanism in some cases.
The previous proposal for adding ping message in network is also for the
detection mechanism in essence. But if it relies on the task failure to realize
the report mechanism, it would bring unnecessary job restart for fake ping
timeout.
Considering this ticket case, I think we could still follow the "richer
exception handling" direction to some extent. For confirmed exceptions, we can
rely on the task failure to report JM as did before. For ambiguous exceptions
we should have a mechanism to report to JM directly such as
`PartitionNotFoundException` did, then the JM has the global information for
final decision. E.g. the JM can inquire the other side state, or wait for a
while because of the other side report delay, or even send RPC to ask for the
other side if necessary before making decisions.
> Add heartbeat between netty server and client to detect long connection alive
> -----------------------------------------------------------------------------
>
> Key: FLINK-16030
> URL: https://issues.apache.org/jira/browse/FLINK-16030
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Network
> Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0
> Reporter: begginghard
> Assignee: begginghard
> Priority: Major
>
> As reported on [the user mailing
> list|https://lists.apache.org/[email protected]:lte=1M:Encountered%20error%20while%20consuming%20partitions]
> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio
> packet loss).
> When the long tcp connection between netty client and server is lost, the
> server would failed to send response to the client, then shut down the
> channel. At the same time, the netty client does not know that the connection
> has been disconnected, so it has been waiting for two hours.
> To detect the long tcp connection alive on netty client and server, we should
> have two ways: tcp keepalive and heartbeat.
>
> The tcp keepalive is 2 hours by default. When the long tcp connection dead,
> you continue to wait for 2 hours, the netty client will trigger exception and
> enter failover recovery.
> If you want to detect quickly, netty provides IdleStateHandler which it use
> ping-pang mechanism. If netty client sends continuously n ping message and
> receives no one pang message, then trigger exception.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)