[
https://issues.apache.org/jira/browse/KUDU-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16334562#comment-16334562
]
Todd Lipcon commented on KUDU-2192:
-----------------------------------
I dont think we have any 30-minute timer in the krpc code itself. My guess is
you're hitting a built in Linux tcp timeout:
{code}
tcp_retries2 (integer; default: 15; since Linux 2.2)
The maximum number of times a TCP packet is retransmitted in
established state before giving up. The default value is 15,
which corresponds to a duration of approximately between 13 to
30 minutes, depending on the retransmission timeout. The
RFC 1122 specified minimum limit of 100 seconds is typically
deemed too short.
{code}
This could be shortened by setting SO_KEEPALIVE and some kind of shorter
keepalive time.
> KRPC should have a timer to close stuck connections
> ---------------------------------------------------
>
> Key: KUDU-2192
> URL: https://issues.apache.org/jira/browse/KUDU-2192
> Project: Kudu
> Issue Type: Improvement
> Components: rpc
> Reporter: Michael Ho
> Priority: Major
>
> If the remote host goes down or its network gets unplugged, all pending RPCs
> to that host will be stuck if there is no timeout specified. While those RPCs
> which have finished sending their payloads or those which haven't started
> sending payloads can be cancelled quickly, those in mid-transmission (i.e. an
> RPC at the front of the outbound queue with part of its payload sent already)
> cannot be cancelled until the payload has been completely sent. Therefore,
> it's beneficial to have a timeout to kill a connection if it's not making any
> progress for an extended period of time so the RPC will fail and get unstuck.
> The timeout may need to be conservatively large to avoid aggressive closing
> of connections due to transient network issue. One can consider augmenting
> the existing maintenance thread logic which checks for idle connection to
> check for this kind of timeout. Please feel free to propose other
> alternatives (e.g. TPC keepalive timeout) in this JIRA.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)