[
https://issues.apache.org/jira/browse/IMPALA-9137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sahil Takiar resolved IMPALA-9137.
----------------------------------
Fix Version/s: Impala 3.4.0
Resolution: Fixed
> Blacklist node if a DataStreamService RPC to the node fails
> -----------------------------------------------------------
>
> Key: IMPALA-9137
> URL: https://issues.apache.org/jira/browse/IMPALA-9137
> Project: IMPALA
> Issue Type: Sub-task
> Components: Backend
> Reporter: Sahil Takiar
> Assignee: Sahil Takiar
> Priority: Major
> Fix For: Impala 3.4.0
>
>
> If a query fails because a RPC to a specific node failed, the query error
> message will similar to one of the following:
> * {{ERROR: TransmitData() to 10.65.30.141:27000 failed: Network error: recv
> got EOF from 10.65.30.141:27000 (error 108)}}
> * {{ERROR: TransmitData() to 10.65.29.251:27000 failed: Network error: recv
> error from 0.0.0.0:0: Transport endpoint is not connected (error 107)}}
> * {{ERROR: TransmitData() to 10.65.26.254:27000 failed: Network error: Client
> connection negotiation failed: client connection to 10.65.26.254:27000:
> connect: Connection refused (error 111)}}
> * {{ERROR: EndDataStream() to 127.0.0.1:27002 failed: Network error: recv
> error from 0.0.0.0:0: Transport endpoint is not connected (error 107)}}
> RPCs are already retried, so it is likely that something is wrong with the
> target node. Perhaps it crashed or is so overloaded that it can't process RPC
> requests. In any case, the Impala Coordinator should blacklist the target of
> the failed RPC so that future queries don't fail with the same error.
> If the node crashed, the statestore will eventually remove the failed node
> from the cluster as well. However, the statestore can take a while to detect
> a failed node because it has a long timeout. The issue is that queries can
> still fail in within the timeout window.
> This is necessary for transparent query retries because if a node does crash,
> it will take too long for the statestore to remove the crashed node from the
> cluster. So any attempt at retrying a query will just fail.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]