[
https://issues.apache.org/jira/browse/KUDU-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333149#comment-16333149
]
Michael Ho edited comment on KUDU-2192 at 1/20/18 2:53 AM:
-----------------------------------------------------------
[~mmokhtar], thanks for confirming.
Off the top of my head, it's unclear where the 30 minutes value comes from.
Doesn't seem to match the default negotiation timeout unless we changed it to a
high value.
{noformat}
DEFINE_int64(rpc_negotiation_timeout_ms, 3000,
"Timeout for negotiating an RPC connection.");
{noformat}
My guess is that it came from some callers of {{Socket::BlockingRecv()}} (most
likely negotiation related) but nothing stands out as 30 minutes to me from the
brief glance of the code. Still need some digging.
[~mmokhtar], if possible, would you mind trying another experiment in which we
insert an iptables rule only after some row batches have been sent over the
network so we can block the connection only after the connections have been
established. The thing we need to verify in that case is whether some queries
may still be stuck after cancellation is requested. Normally, the RPCs of the
cancelled queries should be cancelled. The only corner case described in this
JIRA is when the RPC payload may have been partially sent, in which case,
cancellation will not be honored until the entire payload has been sent. This
is unlikely given that the kernel would have buffered the entire payload unless
it's too large to be absorbed. That said, it doesn't hurt to verify and include
this as part of fault injection testing. We may catch something every now and
then.
was (Author: kwho):
[~mmokhtar], thanks for confirming.
Off the top of my head, it's unclear where the 30 minutes value comes from.
Doesn't seem to match the default negotiation timeout unless we changed it to a
high value.
{noformat}
DEFINE_int64(rpc_negotiation_timeout_ms, 3000,
"Timeout for negotiating an RPC connection.");
{noformat}
My guess is that it came from some callers of {{Socket::BlockingRecv()}} but
nothing stands out as 30 minutes to me from the brief glance of the code. Still
need some digging.
[~mmokhtar], if possible, would you mind trying another experiment in which we
insert an iptables rule only after some row batches have been sent over the
network so we can block the connection only after the connections have been
established. The thing we need to verify in that case is whether some queries
may still be stuck after cancellation is requested. Normally, the RPCs of the
cancelled queries should be cancelled. The only corner case described in this
JIRA is when the RPC payload may have been partially sent, in which case,
cancellation will not be honored until the entire payload has been sent. This
is unlikely given that the kernel would have buffered the entire payload unless
it's too large to be absorbed. That said, it doesn't hurt to verify and include
this as part of fault injection testing. We may catch something every now and
then.
> KRPC should have a timer to close stuck connections
> ---------------------------------------------------
>
> Key: KUDU-2192
> URL: https://issues.apache.org/jira/browse/KUDU-2192
> Project: Kudu
> Issue Type: Improvement
> Components: rpc
> Reporter: Michael Ho
> Priority: Major
>
> If the remote host goes down or its network gets unplugged, all pending RPCs
> to that host will be stuck if there is no timeout specified. While those RPCs
> which have finished sending their payloads or those which haven't started
> sending payloads can be cancelled quickly, those in mid-transmission (i.e. an
> RPC at the front of the outbound queue with part of its payload sent already)
> cannot be cancelled until the payload has been completely sent. Therefore,
> it's beneficial to have a timeout to kill a connection if it's not making any
> progress for an extended period of time so the RPC will fail and get unstuck.
> The timeout may need to be conservatively large to avoid aggressive closing
> of connections due to transient network issue. One can consider augmenting
> the existing maintenance thread logic which checks for idle connection to
> check for this kind of timeout. Please feel free to propose other
> alternatives (e.g. TPC keepalive timeout) in this JIRA.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)