[ 
https://issues.apache.org/jira/browse/IMPALA-6159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16756632#comment-16756632
 ] 

Todd Lipcon commented on IMPALA-6159:
-------------------------------------

Ran into a similar issue today on a cluster where a couple nodes had been 
hard-rebooted (power cycled rather than graceful restart). The Impala cluster 
here was idle for a couple days, but then once I started running queries 
against it, they would fail with TransmitData RPC errors trying to talk to the 
nodes that had been power-cycled. After some investigation we diagnosed the 
issue as the following:

- prior to the power cycle, every impalad had a connection to the target node
- when the power was cycled, the target node didn't send any TCP RST, because 
the process (and the kernel) never got a chance to shut down cleanly
- when the machine was back up, other hosts still believed to be connected to 
it, but the sockets were not open on the cycled machine
- because impalad was idle with no RPCs, this state persists indefinitely
- when I run a query, eventually one node wants to exchange some data to the 
cycled node. The sender thinks an RPC connection is open, so sends the packet 
on that existing connection. It immediately gets an RST, which fails the query 
(because there's no retry, as observed in this JIRA). If there were a retry, 
the KRPC subsystem would happily re-establish a new connection and proceed with 
the query.

Unfortunately, each time I run a query, only one node gets as far as sending 
data to the cycled node, and "realizes" its been cycled. So, with 100 nodes in 
the cluster, we need to run 100 queries and let them fail before we'll have 
gotten everyone to realize there's been a power cycled node. Pretty bad stuff.

As for solutions:
- TransmitData should probably retry, and use sequence numbers to ensure we 
don't end up with a duplicate in odd failure scenarios.
- We should enable SO_KEEPALIVE and probably SO_USER_TIMEOUT on the TCP 
streams, so that when a node is cycled, the other nodes figure it out within 
some bounded amount of time. We should probably also set the keepalive idle 
time with some kind of jitter so that we don't have thundering herds of 
keepalive packets at regular intervals across the cluster (since most of the 
connections likely go idle at the same time as each other)

> DataStreamSender should transparently handle some connection reset by peer
> --------------------------------------------------------------------------
>
>                 Key: IMPALA-6159
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6159
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Distributed Exec
>            Reporter: Michael Ho
>            Priority: Major
>
> A client to server KRPC connection can become stale if the socket was closed 
> on the server side due to various reasons such as idle connection removal or 
> remote Impalad restart. Currently, the KRPC code will invoke the callback of 
> all RPCs using that stale connection with the failed status (e.g. "Connection 
> reset by peer"). DataStreamSender should pattern match against certain error 
> string (as they are mostly output from strerror()) and retry the RPC 
> transparently. This may be also be useful for KUDU-2192 which tracks the 
> effort to detect stuck connection and close them. In which case, we may also 
> want to transparently retry the RPC
> FWIW, KUDU-279 is tracking the effort to have a cleaner protocol for 
> connection teardown due to idle client connection removal on the server side. 
> However, Impala still needs to handle other reasons for a stale connection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to