Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/14677 )
Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the node fails ...................................................................... Patch Set 10: (12 comments) http://gerrit.cloudera.org:8080/#/c/14677/10//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/14677/10//COMMIT_MSG@18 PS10, Line 18: the target node. For coordinator C and executors S and T, this means that if S fails to send data to T, it will report back to C, and C will blacklist T, right? In that case, is it guaranteed that S will also blacklist T? Does this open up the possibility that a failure in S will blacklist other healthy nodes? Are there errors in the S->T RPC that we should ignore for the purpose of blacklisting? http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.h File be/src/runtime/coordinator.h: http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.h@252 PS10, Line 252: /// TNetworkAddress. mention ownership, nullptr (see above) http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc@209 PS10, Line 209: if (UNLIKELY(!addr_to_backend_state_.insert(addr_to_backend_state).second)) { you can use emplace() to construct the element in place. insert()/emplace() return an pair<iterator,bool>. It's not clear to me how this gets evaluated in the conditional, I think it would be better to assign them explicitly. http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc@815 PS10, Line 815: for (auto status_iter = request.instance_exec_status().begin(); Can you use a range-based loop? http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc@826 PS10, Line 826: addr_to_backend_state_.at(NetworkAddressPBToTNetworkAddress(dest_node)); This will throw an exception if the element does not exist (which is almost certainly not what we want). Instead you could just []-access the element and check for nullptr, or use find() != end(), and issue a warning in either case. http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc@829 PS10, Line 829: blacklisted_node = true; Could just break; the loop here. I can't see where blacklisted_node is used otherwise, too. http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.h File be/src/runtime/runtime-state.h: http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.h@322 PS10, Line 322: AuxErrorInfoPB* aux_error_info() { Does this ever need to be non-const or can we return a const pointer? http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.h@444 PS10, Line 444: /// query_status_ != Status::OK(). Mention who owns this and who cleans it up? We should not rely on the rest of the code to call set_allocated_* eventually but instead we should make it a unique_ptr if we want to preserve the lazy allocation semantics. http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.cc File be/src/runtime/runtime-state.cc: http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.cc@317 PS10, Line 317: NetworkAddressPB* network_addr = new NetworkAddressPB(); Can you remove the unguarded allocations? You can either replace them with unique_ptr, or assemble the objects top down using mutable_rpc_error_info() etc. Alternatively for this one and then one in L320 you can move the new into the set_allocate_* call (but then you might as well use mutable_*. The one in 316 needs to go because the ownership leaves the scope of this function. http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/util/network-util.h File be/src/util/network-util.h: http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/util/network-util.h@72 PS10, Line 72: TNetworkAddress NetworkAddressPBToTNetworkAddress(const NetworkAddressPB& address); I think FromNetworkAddressPB() would be descriptive enough since the calling code will show the target type (similar to MakeNetworkAddress above). http://gerrit.cloudera.org:8080/#/c/14677/10/common/protobuf/control_service.proto File common/protobuf/control_service.proto: http://gerrit.cloudera.org:8080/#/c/14677/10/common/protobuf/control_service.proto@137 PS10, Line 137: RPC to another node failed I think here it would be good to mention that this is between two executors, not the coordinator and an executor http://gerrit.cloudera.org:8080/#/c/14677/10/tests/custom_cluster/test_blacklist.py File tests/custom_cluster/test_blacklist.py: http://gerrit.cloudera.org:8080/#/c/14677/10/tests/custom_cluster/test_blacklist.py@121 PS10, Line 121: """Verifies that an RPC failure causes the target node to be blacklisted. The Can you add a test that checks that a failure between the two non-coordinator executors in the cluster gets one of them blacklisted? -- To view, visit http://gerrit.cloudera.org:8080/14677 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c Gerrit-Change-Number: 14677 Gerrit-PatchSet: 10 Gerrit-Owner: Sahil Takiar <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-Comment-Date: Wed, 11 Dec 2019 18:33:50 +0000 Gerrit-HasComments: Yes
