[jira] [Comment Edited] (KUDU-2192) KRPC should have a timer to close stuck connections

Michael Ho (JIRA) Fri, 19 Jan 2018 18:54:35 -0800

    [ 
https://issues.apache.org/jira/browse/KUDU-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333149#comment-16333149
 ]


Michael Ho edited comment on KUDU-2192 at 1/20/18 2:53 AM:
-----------------------------------------------------------

[~mmokhtar], thanks for confirming.

Off the top of my head, it's unclear where the 30 minutes value comes from. 
Doesn't seem to match the default negotiation timeout unless we changed it to a 
high value.
{noformat}
DEFINE_int64(rpc_negotiation_timeout_ms, 3000,
             "Timeout for negotiating an RPC connection.");
{noformat}
My guess is that it came from some callers of {{Socket::BlockingRecv()}} (most 
likely negotiation related) but nothing stands out as 30 minutes to me from the 
brief glance of the code. Still need some digging.

[~mmokhtar], if possible, would you mind trying another experiment in which we 
insert an iptables rule only after some row batches have been sent over the 
network so we can block the connection only after the connections have been 
established. The thing we need to verify in that case is whether some queries 
may still be stuck after cancellation is requested. Normally, the RPCs of the 
cancelled queries should be cancelled. The only corner case described in this 
JIRA is when the RPC payload may have been partially sent, in which case, 
cancellation will not be honored until the entire payload has been sent. This 
is unlikely given that the kernel would have buffered the entire payload unless 
it's too large to be absorbed. That said, it doesn't hurt to verify and include 
this as part of fault injection testing. We may catch something every now and 
then.


was (Author: kwho):
[~mmokhtar], thanks for confirming.

Off the top of my head, it's unclear where the 30 minutes value comes from. 
Doesn't seem to match the default negotiation timeout unless we changed it to a 
high value.
{noformat}
DEFINE_int64(rpc_negotiation_timeout_ms, 3000,
             "Timeout for negotiating an RPC connection.");
{noformat}
My guess is that it came from some callers of {{Socket::BlockingRecv()}} but 
nothing stands out as 30 minutes to me from the brief glance of the code. Still 
need some digging.

[~mmokhtar], if possible, would you mind trying another experiment in which we 
insert an iptables rule only after some row batches have been sent over the 
network so we can block the connection only after the connections have been 
established. The thing we need to verify in that case is whether some queries 
may still be stuck after cancellation is requested. Normally, the RPCs of the 
cancelled queries should be cancelled. The only corner case described in this 
JIRA is when the RPC payload may have been partially sent, in which case, 
cancellation will not be honored until the entire payload has been sent. This 
is unlikely given that the kernel would have buffered the entire payload unless 
it's too large to be absorbed. That said, it doesn't hurt to verify and include 
this as part of fault injection testing. We may catch something every now and 
then.

> KRPC should have a timer to close stuck connections
> ---------------------------------------------------
>
>                 Key: KUDU-2192
>                 URL: https://issues.apache.org/jira/browse/KUDU-2192
>             Project: Kudu
>          Issue Type: Improvement
>          Components: rpc
>            Reporter: Michael Ho
>            Priority: Major
>
> If the remote host goes down or its network gets unplugged, all pending RPCs 
> to that host will be stuck if there is no timeout specified. While those RPCs 
> which have finished sending their payloads or those which haven't started 
> sending payloads can be cancelled quickly, those in mid-transmission (i.e. an 
> RPC at the front of the outbound queue with part of its payload sent already) 
> cannot be cancelled until the payload has been completely sent. Therefore, 
> it's beneficial to have a timeout to kill a connection if it's not making any 
> progress for an extended period of time so the RPC will fail and get unstuck. 
> The timeout may need to be conservatively large to avoid aggressive closing 
> of connections due to transient network issue. One can consider augmenting 
> the existing maintenance thread logic which checks for idle connection to 
> check for this kind of timeout. Please feel free to propose other 
> alternatives (e.g. TPC keepalive timeout) in this JIRA.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (KUDU-2192) KRPC should have a timer to close stuck connections

Reply via email to