Thomas Tauber-Marshall has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13868 )

Change subject: IMPALA-8339: Add local executor blacklist to coordinators
......................................................................


Patch Set 4:

(19 comments)

http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/admission-controller.cc
File be/src/scheduling/admission-controller.cc:

http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/admission-controller.cc@737
PS3, Line 737: string
> nit: std:: not needed
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/cluster-membership-mgr.cc
File be/src/scheduling/cluster-membership-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/cluster-membership-mgr.cc@272
PS3, Line 272: already be on the blacklist or probation.
> This may be a case worth double checking. If a node is restarted and re-reg
Yes, this DCHECK is correct - if a node is restarted, when it re-registers with 
the statestore the old subscriber is always unregistered first, see: 
https://github.com/apache/impala/blob/master/be/src/statestore/statestore.cc#L623

This is also required by the loop above, which will fail if AddExecutor() is 
called with an existing executor.


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/cluster-membership-mgr.cc@273
PS3, Line 273: DCHECK
> DCHECK_EQ
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/cluster-membership-mgr.cc@350
PS3, Line 350:  exists
> Given the for-loop above, can an executor belong to more than one executor
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/cluster-membership-mgr.cc@355
PS3, Line 355: exists = true;
> This is repeated at multiple places in this function. May make sense to fac
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h
File be/src/scheduling/executor-blacklist.h:

http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h@32
PS3, Line 32:
> FindAndRemove()
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h@38
PS3, Line 38: When a blacklisted executor has passed this timeout
            : /// and Maintenance() is called, the executor is put on 
'probation'
> May help to also document what if an executor on blacklist is removed durin
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h@42
PS3, Line 42: There
> nit: typo
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h@79
PS3, Line 79:
> May be clearer to call it "probation_list" or something so it can relates t
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h@124
PS3, Line 124:
> typo
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc
File be/src/scheduling/executor-blacklist.cc:

http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@29
PS3, Line 29: 0 disables blacklisting
> This seems a bit weird. Setting it to 0 means the blacklist timeout == stat
My problem with setting it to 120 is that it makes it possible for users to set 
it to < 100, which doesn't seem like it would ever be the right thing to do 
(eg. for executors that go down, it would make it very likely they would be 
incorrectly unblacklisted).

Maybe it would be better to just add another flag, blacklisting_enabled or 
similar. I was trying to minimize the number of flags, but I guess it doesn't 
really matter. I don't think we'll be formally documenting these or expecting 
customers to commonly set them, they're just intended as emergency safety 
valves.

Or we could do something like disable blacklisting if this is negative.


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@39
PS3, Line 39:  ExecutorBlacklist::Blacklist(const TBackendDescriptor& be_desc) {
> Should this check if blacklisting is disabled at the beginning ?
I put the check in ClusterMembershipMgr::Blacklist to avoid copying the 
Snapshot if blacklisting is disabled.

I'll at least add a DCHECK here.


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@54
PS3, Line 54: nicMillis()
> Why not MonotonicMillis() ? This field shouldn't need to be shared across m
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@83
PS3, Line 83:   if (remove_it == be_descs.end()) {
            :     // Executor wasn't on the blacklist.
            :     return NOT_BLACKLISTED;
            :   }
            :   St
> This seems to be used at more than one functions. May be worth factoring it
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@99
PS3, Line 99:    for (auto entry_it : e
> Is remove_it still valid after calling erase above ?
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@110
PS3, Line 110:
> put on probation
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@138
PS3, Line 138:
> remove it from the blacklist.
I think this comment is correct? This branch is handling things that were on 
probation


http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@184
PS3, Line 184:
> "put on probation"  ?
Done


http://gerrit.cloudera.org:8080/#/c/13868/3/tests/custom_cluster/test_blacklist.py
File tests/custom_cluster/test_blacklist.py:

http://gerrit.cloudera.org:8080/#/c/13868/3/tests/custom_cluster/test_blacklist.py@62
PS3, Line 62:         (killed_impalad.hostname, 
killed_impalad.service.be_port), result.runtime_profile
> Does it make sense to also add a test case when the node is re-added back t
Added a test for restarting an impalad.

Agreed that we should have a test for transient failure. Its trickier since 
such a debug action doesn't exist yet. I was planning on doing that in a follow 
up patch, but I can include it in this patch if you prefer



--
To view, visit http://gerrit.cloudera.org:8080/13868
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74
Gerrit-Change-Number: 13868
Gerrit-PatchSet: 4
Gerrit-Owner: Thomas Tauber-Marshall <tmarsh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com>
Gerrit-Comment-Date: Thu, 25 Jul 2019 22:35:33 +0000
Gerrit-HasComments: Yes

Reply via email to