Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 3:

(19 comments)

Looking good.

http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/admission-controller.cc
File be/src/scheduling/admission-controller.cc:

http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/admission-controller.cc@737
PS3, Line 737: std::s
nit: std:: not needed


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/cluster-membership-mgr.cc
File be/src/scheduling/cluster-membership-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/cluster-membership-mgr.cc@272
PS3, Line 272: already be on the blacklist or probation.
This may be a case worth double checking. If a node is restarted and 
re-register with the Statestore, is it guaranteed that the "delete" update will 
happen before the "create" update ? Can updates arrive out of order ?

If it's somehow possible that the node being added is still on blacklist. It 
may be safer to call FindAndRemove() unconditionally and log it if we happen to 
find it on the blacklist in the "create" path.


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/cluster-membership-mgr.cc@273
PS3, Line 273: DCHECK
DCHECK_EQ


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/cluster-membership-mgr.cc@350
PS3, Line 350: return;
Given the for-loop above, can an executor belong to more than one executor 
groups ?  If so, why is it okay to return early if it's not found in only one 
of the groups ?

Also, may help to comment on the intention for doing this check. I suspect it's 
to skip the rather heavy weight copying of the membership state below.


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/cluster-membership-mgr.cc@355
PS3, Line 355: recovering_membership_.get() != nullptr
This is repeated at multiple places in this function. May make sense to factor 
it out to a local variable for clarity.


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h
File be/src/scheduling/executor-blacklist.h:

http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h@32
PS3, Line 32: Remove()
FindAndRemove()


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h@38
PS3, Line 38: When a blacklisted executor has passed this timeout
            : /// and Maintenance() is called, the executor is put on 
'probation'
May help to also document what if an executor on blacklist is removed during 
cluster membership updates.


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h@42
PS3, Line 42: THere
nit: typo


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h@79
PS3, Line 79: unblacklisted
May be clearer to call it "probation_list" or something so it can relates to 
the state in which the backends on this list are in.


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h@124
PS3, Line 124: pzrobation
typo


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc
File be/src/scheduling/executor-blacklist.cc:

http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@29
PS3, Line 29: 0 disables blacklisting
This seems a bit weird. Setting it to 0 means the blacklist timeout == 
statestore_max_missed_heartbeats * statestore_heartbeat_frequency_ms, right ?

Would it make more sense to set this to 120 or something by default so it means 
(statestore_max_missed_heartbeats * statestore_heartbeat_frequency_ms) * (this 
flag / 100.0) ? So setting this flag to 0 means the timeout is 0 which makes 
sense to mean blacklisting is disabled.


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@39
PS3, Line 39:  ExecutorBlacklist::Blacklist(const TBackendDescriptor& be_desc) {
Should this check if blacklisting is disabled at the beginning ?


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@54
PS3, Line 54: UnixMillis()
Why not MonotonicMillis() ? This field shouldn't need to be shared across 
multiple hosts, right ? MonotonicMillis() makes sure the clock will not go 
backward in case the TSC on different cores are slightly out of sync.


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@83
PS3, Line 83:   auto eq = [&be_desc](const Entry& existing) {
            :     // The IP addresses must already match, so it is sufficient 
to check the port.
            :     DCHECK_EQ(existing.be_desc.ip_address, be_desc.ip_address);
            :     return existing.be_desc.address.port == be_desc.address.port;
            :   };
This seems to be used at more than one functions. May be worth factoring it out 
as a separate function.


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@99
PS3, Line 99:  return remove_it->state;
Is remove_it still valid after calling erase above ?


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@110
PS3, Line 110: unblacklisted
put on probation


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@138
PS3, Line 138: take it off probation.
remove it from the blacklist.


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@184
PS3, Line 184: unblacklisted
"put on probation"  ?


http://gerrit.cloudera.org:8080/#/c/13868/3/tests/custom_cluster/test_blacklist.py
File tests/custom_cluster/test_blacklist.py:

http://gerrit.cloudera.org:8080/#/c/13868/3/tests/custom_cluster/test_blacklist.py@62
PS3, Line 62:     assert re.search("Blacklisted Executors: (.*)", profile) is 
None
Does it make sense to also add a test case when the node is re-added back to 
the cluster ?

Also, does it make sense to test for transient failure (e.g. with debug action) 
so we can randomly fail the ExecFInstanceRPC() and exercise the path of 
probation and eventual removal from blacklist.



--
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 3
Gerrit-Owner: Thomas Tauber-Marshall <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Comment-Date: Wed, 24 Jul 2019 23:33:22 +0000
Gerrit-HasComments: Yes

Reply via email to