Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )
Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails ...................................................................... Patch Set 12: (7 comments) I think there a question here of how aggressive we want to be about blacklisting. Impala has two mechanisms for updating the cluster membership: the statestore, which is fairly conservative (by default it takes 10 failed heartbeats in a row for a node to be removed by the statestore), is considered the ground truth, is communicated to all coordinators in the cluster. Once a node is removed by the statestore, it has to explicitly re-register itself to get re-included in the membership. Then there's blacklisting, which only takes a single failed rpc to get a node blacklisted, but blacklisting is a local decision by a coordinator and never communicated to other machines, and a node that is blacklisted a single time will only stay on the blacklist for 12s by default before getting put back into the membership (nodes that are repeatedly blacklisted will stay on for longer) So, I don't think that we need to be overly afraid about blacklisting things aggressively, since if the node actually isn't unhealthy blacklisting it once isn't going to have that big of an impact on the cluster. That's particularly the case since this work is aimed at facilitating transparent query retry. We definitely want to avoid scenarios where a TransmitData rpc fails, we don't blacklist the node because the failure doesn't match our whitelist of posix codes but we go ahead and retry the query anyways and then it fails again because we schedule it on the same bad node again. Of course, for this patch and the followup patch, we can just only mark the status as retryable if we do in fact decide to blacklist the node by doing the check for posix codes in KrpcDataStreamSender, but there's still the problem that this patch will only blacklist one node per query. To be clear - I think for the first iteration of all of this its probably fine to be fairly conservative about what we blacklist, and if sometimes queries fail when we could have made them succeed by being more aggressive that's okay (so long as they would've failed even without the blacklisting work, such that its not a regression). Just wanted to make the points that 1) blacklisting was designed in a way where its not that big of a deal and we don't need to worry too much about screwing up a cluster by blacklisting a node unnecessarily and 2) we probably really want to avoid the situation where we retry a query and it fails again for the same reason http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.h File be/src/runtime/coordinator.h: http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.h@152 PS12, Line 152: void UpdateBlacklistWithAuxErrorInfo(const ReportExecStatusRequestPB& request); This can be private http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.h@263 PS12, Line 263: boost::unordered_map<TNetworkAddress, BackendState*> addr_to_backend_state_; I think we might already have this mapping in QuerySchedule::per_backend_exec_params()? http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.cc@862 PS12, Line 862: LOG(WARNING) << "Query failed due to a failed RPC to an unknown target address " Doesn't seem like this case should be possible, since 'addr_to_backend_state_' was built using the schedule for this query, and fragments can't possibly try to send rpcs to backends that the query wasn't scheduled on http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.cc@879 PS12, Line 879: static const set<int32_t> blacklistable_rpc_error_codes = { Do you examples of posix codes a NetworkError could generate that we wouldn't want to blacklist? We also may want to do this check on the other side and only send the AuxErrorInfoPB if the posix code matches (see my other comments) http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.cc@882 PS12, Line 882: ECONNREFUSED I think you can specify these directly instead of having to use the actual number value if you "#include <cerrno>" http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.cc@892 PS12, Line 892: break; Was it your intention to not continue through the loop if the first RPCErrorInfoPB that gets to this point doesn't have a matching posix_code? Maybe move this break into the 'if' above? http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/runtime-state.h File be/src/runtime/runtime-state.h: http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/runtime-state.h@439 PS12, Line 439: struct AuxErrorInfo { Why not just use AuxErrorInfoPB directly? -- To view, visit http://gerrit.cloudera.org:8080/14677 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c Gerrit-Change-Number: 14677 Gerrit-PatchSet: 12 Gerrit-Owner: Sahil Takiar <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-Comment-Date: Fri, 13 Dec 2019 18:56:59 +0000 Gerrit-HasComments: Yes
