Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )
Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails ...................................................................... Patch Set 12: (7 comments) http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.h File be/src/runtime/coordinator.h: http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.h@152 PS12, Line 152: void UpdateBlacklistWithAuxErrorInfo(const ReportExecStatusRequestPB& request); > This can be private Done http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.h@263 PS12, Line 263: boost::unordered_map<TNetworkAddress, BackendState*> addr_to_backend_state_; > I think we might already have this mapping in QuerySchedule::per_backend_ex Yeah, but that mapping uses Thrift address rather than KRPC addresses. Updated the docs to make this more clear. http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.cc@862 PS12, Line 862: LOG(WARNING) << "Query failed due to a failed RPC to an unknown target address " > Doesn't seem like this case should be possible, since 'addr_to_backend_stat I added a DCHECK(false) and changed the log level to ERROR. http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.cc@879 PS12, Line 879: static const set<int32_t> blacklistable_rpc_error_codes = { > Do you examples of posix codes a NetworkError could generate that we wouldn Discussed offline. There are a bunch of error codes (https://www-numi.fnal.gov/offline_software/srt_public_context/WebDocs/Errors/unix_system_errors.html) that don't seem useful to blacklist. I'm not sure what set of error codes KRPC is even possible of setting for network errors. Regardless, its probably safer to selectively add error codes to this list as we see fit. http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.cc@882 PS12, Line 882: ECONNREFUSED > I think you can specify these directly instead of having to use the actual Done http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/coordinator.cc@892 PS12, Line 892: break; > Was it your intention to not continue through the loop if the first RPCErro Done http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/runtime-state.h File be/src/runtime/runtime-state.h: http://gerrit.cloudera.org:8080/#/c/14677/12/be/src/runtime/runtime-state.h@439 PS12, Line 439: struct AuxErrorInfo { > Why not just use AuxErrorInfoPB directly? Done -- To view, visit http://gerrit.cloudera.org:8080/14677 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c Gerrit-Change-Number: 14677 Gerrit-PatchSet: 12 Gerrit-Owner: Sahil Takiar <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-Comment-Date: Thu, 19 Dec 2019 03:51:47 +0000 Gerrit-HasComments: Yes
