Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the 
node fails
......................................................................


Patch Set 12:

(7 comments)

I think there a question here of how aggressive we want to be about 
blacklisting.

Impala has two mechanisms for updating the cluster membership: the statestore, 
which is fairly conservative (by default it takes 10 failed heartbeats in a row 
for a node to be removed by the statestore), is considered the ground truth, is 
communicated to all coordinators in the cluster. Once a node is removed by the 
statestore, it has to explicitly re-register itself to get re-included in the 
membership.

Then there's blacklisting, which only takes a single failed rpc to get a node 
blacklisted, but blacklisting is a local decision by a coordinator and never 
communicated to other machines, and a node that is blacklisted a single time 
will only stay on the blacklist for 12s by default before getting put back into 
the membership (nodes that are repeatedly blacklisted will stay on for longer)

So, I don't think that we need to be overly afraid about blacklisting things 
aggressively, since if the node actually isn't unhealthy blacklisting it once 
isn't going to have that big of an impact on the cluster.

That's particularly the case since this work is aimed at facilitating 
transparent query retry. We definitely want to avoid scenarios where a 
TransmitData rpc fails, we don't blacklist the node because the failure doesn't 
match our whitelist of posix codes but we go ahead and retry the query anyways 
and then it fails again because we schedule it on the same bad node again.

Of course, for this patch and the followup patch, we can just only mark the 
status as retryable if we do in fact decide to blacklist the node by doing the 
check for posix codes in KrpcDataStreamSender, but there's still the problem 
that this patch will only blacklist one node per query.

To be clear - I think for the first iteration of all of this its probably fine 
to be fairly conservative about what we blacklist, and if sometimes queries 
fail when we could have made them succeed by being more aggressive that's okay 
(so long as they would've failed even without the blacklisting work, such that 
its not a regression). Just wanted to make the points that 1) blacklisting was 
designed in a way where its not that big of a deal and we don't need to worry 
too much about screwing up a cluster by blacklisting a node unnecessarily and 
2) we probably really want to avoid the situation where we retry a query and it 
fails again for the same reason

http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.h
File be/src/runtime/coordinator.h:

http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.h@152
PS12, Line 152:   void UpdateBlacklistWithAuxErrorInfo(const 
ReportExecStatusRequestPB& request);
This can be private


http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.h@263
PS12, Line 263:   boost::unordered_map<TNetworkAddress, BackendState*> 
addr_to_backend_state_;
I think we might already have this mapping in 
QuerySchedule::per_backend_exec_params()?


http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.cc@862
PS12, Line 862:         LOG(WARNING) << "Query failed due to a failed RPC to an 
unknown target address "
Doesn't seem like this case should be possible, since 'addr_to_backend_state_' 
was built using the schedule for this query, and fragments can't possibly try 
to send rpcs to backends that the query wasn't scheduled on


http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.cc@879
PS12, Line 879:       static const set<int32_t> blacklistable_rpc_error_codes = 
{
Do you examples of posix codes a NetworkError could generate that we wouldn't 
want to blacklist?

We also may want to do this check on the other side and only send the 
AuxErrorInfoPB if the posix code matches (see my other comments)


http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.cc@882
PS12, Line 882: ECONNREFUSED
I think you can specify these directly instead of having to use the actual 
number value if you "#include <cerrno>"


http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.cc@892
PS12, Line 892: break;
Was it your intention to not continue through the loop if the first 
RPCErrorInfoPB that gets to this point doesn't have a matching posix_code? 
Maybe move this break into the 'if' above?


http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/runtime-state.h
File be/src/runtime/runtime-state.h:

http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/runtime-state.h@439
PS12, Line 439:   struct AuxErrorInfo {
Why not just use AuxErrorInfoPB directly?



--
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 12
Gerrit-Owner: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Comment-Date: Fri, 13 Dec 2019 18:56:59 +0000
Gerrit-HasComments: Yes

Reply via email to