[
https://issues.apache.org/jira/browse/KUDU-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16271335#comment-16271335
]
Dan Burkert commented on KUDU-2192:
-----------------------------------
[~kwho] have you been able to write a unit test which shows this behavior? We
do have fault-injection unit tests in krpc, and as far as I know we've never
seen an issue with stuck RPCs in the wild. Without a concrete example of how
RPCs can get stuck this isn't very actionable.
> KRPC should have a timer to close stuck connections
> ---------------------------------------------------
>
> Key: KUDU-2192
> URL: https://issues.apache.org/jira/browse/KUDU-2192
> Project: Kudu
> Issue Type: Improvement
> Components: rpc
> Reporter: Michael Ho
>
> If the remote host goes down or its network gets unplugged, all pending RPCs
> to that host will be stuck if there is no timeout specified. While those RPCs
> which have finished sending their payloads or those which haven't started
> sending payloads can be cancelled quickly, those in mid-transmission (i.e. an
> RPC at the front of the outbound queue with part of its payload sent already)
> cannot be cancelled until the payload has been completely sent. Therefore,
> it's beneficial to have a timeout to kill a connection if it's not making any
> progress for an extended period of time so the RPC will fail and get unstuck.
> The timeout may need to be conservatively large to avoid aggressive closing
> of connections due to transient network issue. One can consider augmenting
> the existing maintenance thread logic which checks for idle connection to
> check for this kind of timeout. Please feel free to propose other
> alternatives (e.g. TPC keepalive timeout) in this JIRA.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)