[ 
https://issues.apache.org/jira/browse/KUDU-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16334562#comment-16334562
 ] 

Todd Lipcon commented on KUDU-2192:
-----------------------------------

I dont think we have any 30-minute timer in the krpc code itself. My guess is 
you're hitting a built in Linux tcp timeout:

{code}
       tcp_retries2 (integer; default: 15; since Linux 2.2)
              The maximum number of times a TCP  packet  is  retransmitted  in
              established  state  before  giving up.  The default value is 15,
              which corresponds to a duration of approximately between  13  to
              30  minutes,  depending  on  the  retransmission  timeout.   The
              RFC 1122 specified minimum limit of  100  seconds  is  typically
              deemed too short.

{code}

This could be shortened by setting SO_KEEPALIVE and some kind of shorter 
keepalive time.

> KRPC should have a timer to close stuck connections
> ---------------------------------------------------
>
>                 Key: KUDU-2192
>                 URL: https://issues.apache.org/jira/browse/KUDU-2192
>             Project: Kudu
>          Issue Type: Improvement
>          Components: rpc
>            Reporter: Michael Ho
>            Priority: Major
>
> If the remote host goes down or its network gets unplugged, all pending RPCs 
> to that host will be stuck if there is no timeout specified. While those RPCs 
> which have finished sending their payloads or those which haven't started 
> sending payloads can be cancelled quickly, those in mid-transmission (i.e. an 
> RPC at the front of the outbound queue with part of its payload sent already) 
> cannot be cancelled until the payload has been completely sent. Therefore, 
> it's beneficial to have a timeout to kill a connection if it's not making any 
> progress for an extended period of time so the RPC will fail and get unstuck. 
> The timeout may need to be conservatively large to avoid aggressive closing 
> of connections due to transient network issue. One can consider augmenting 
> the existing maintenance thread logic which checks for idle connection to 
> check for this kind of timeout. Please feel free to propose other 
> alternatives (e.g. TPC keepalive timeout) in this JIRA.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to