Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14677 )

Change subject: IMPALA-9137: Blacklist node if a DataStreamService RPC to the 
node fails
......................................................................


Patch Set 10:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/14677/10//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14677/10//COMMIT_MSG@18
PS10, Line 18: the target node.
For coordinator C and executors S and T, this means that if S fails to send 
data to T, it will report back to C, and C will blacklist T, right? In that 
case, is it guaranteed that S will also blacklist T? Does this open up the 
possibility that a failure in S will blacklist other healthy nodes? Are there 
errors in the S->T RPC that we should ignore for the purpose of blacklisting?


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.h
File be/src/runtime/coordinator.h:

http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.h@252
PS10, Line 252:   /// TNetworkAddress.
mention ownership, nullptr (see above)


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc@209
PS10, Line 209:     if 
(UNLIKELY(!addr_to_backend_state_.insert(addr_to_backend_state).second)) {
you can use emplace() to construct the element in place.

insert()/emplace() return an pair<iterator,bool>. It's not clear to me how this 
gets evaluated in the conditional, I think it would be better to assign them 
explicitly.


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc@815
PS10, Line 815:       for (auto status_iter = 
request.instance_exec_status().begin();
Can you use a range-based loop?


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc@826
PS10, Line 826:               
addr_to_backend_state_.at(NetworkAddressPBToTNetworkAddress(dest_node));
This will throw an exception if the element does not exist (which is almost 
certainly not what we want). Instead you could just []-access the element and 
check for nullptr, or use find() != end(), and issue a warning in either case.


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/coordinator.cc@829
PS10, Line 829:           blacklisted_node = true;
Could just break; the loop here. I can't see where blacklisted_node is used 
otherwise, too.


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.h
File be/src/runtime/runtime-state.h:

http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.h@322
PS10, Line 322:   AuxErrorInfoPB* aux_error_info() {
Does this ever need to be non-const or can we return a const pointer?


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.h@444
PS10, Line 444:   /// query_status_ != Status::OK().
Mention who owns this and who cleans it up? We should not rely on the rest of 
the code to call set_allocated_* eventually but instead we should make it a 
unique_ptr if we want to preserve the lazy allocation semantics.


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.cc
File be/src/runtime/runtime-state.cc:

http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/runtime/runtime-state.cc@317
PS10, Line 317:   NetworkAddressPB* network_addr = new NetworkAddressPB();
Can you remove the unguarded allocations? You can either replace them with 
unique_ptr, or assemble the objects top down using mutable_rpc_error_info() 
etc. Alternatively for this one and then one in L320 you can move the new into 
the set_allocate_* call (but then you might as well use mutable_*.

The one in 316 needs to go because the ownership leaves the scope of this 
function.


http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/util/network-util.h
File be/src/util/network-util.h:

http://gerrit.cloudera.org:8080/#/c/14677/10/be/src/util/network-util.h@72
PS10, Line 72: TNetworkAddress NetworkAddressPBToTNetworkAddress(const 
NetworkAddressPB& address);
I think FromNetworkAddressPB() would be descriptive enough since the calling 
code will show the target type (similar to MakeNetworkAddress above).


http://gerrit.cloudera.org:8080/#/c/14677/10/common/protobuf/control_service.proto
File common/protobuf/control_service.proto:

http://gerrit.cloudera.org:8080/#/c/14677/10/common/protobuf/control_service.proto@137
PS10, Line 137: RPC to another node failed
I think here it would be good to mention that this is between two executors, 
not the coordinator and an executor


http://gerrit.cloudera.org:8080/#/c/14677/10/tests/custom_cluster/test_blacklist.py
File tests/custom_cluster/test_blacklist.py:

http://gerrit.cloudera.org:8080/#/c/14677/10/tests/custom_cluster/test_blacklist.py@121
PS10, Line 121:     """Verifies that an RPC failure causes the target node to 
be blacklisted. The
Can you add a test that checks that a failure between the two non-coordinator 
executors in the cluster gets one of them blacklisted?



--
To view, visit http://gerrit.cloudera.org:8080/14677
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I733cca13847fde43c8ea2ae574d3ae04bd06419c
Gerrit-Change-Number: 14677
Gerrit-PatchSet: 10
Gerrit-Owner: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Comment-Date: Wed, 11 Dec 2019 18:33:50 +0000
Gerrit-HasComments: Yes

Reply via email to