Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/13868 )
Change subject: IMPALA-8339: Add local executor blacklist to coordinators ...................................................................... Patch Set 4: (19 comments) http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/admission-controller.cc File be/src/scheduling/admission-controller.cc: http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/admission-controller.cc@737 PS3, Line 737: string > nit: std:: not needed Done http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/cluster-membership-mgr.cc File be/src/scheduling/cluster-membership-mgr.cc: http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/cluster-membership-mgr.cc@272 PS3, Line 272: already be on the blacklist or probation. > This may be a case worth double checking. If a node is restarted and re-reg Yes, this DCHECK is correct - if a node is restarted, when it re-registers with the statestore the old subscriber is always unregistered first, see: https://github.com/apache/impala/blob/master/be/src/statestore/statestore.cc#L623 This is also required by the loop above, which will fail if AddExecutor() is called with an existing executor. http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/cluster-membership-mgr.cc@273 PS3, Line 273: DCHECK > DCHECK_EQ Done http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/cluster-membership-mgr.cc@350 PS3, Line 350: exists > Given the for-loop above, can an executor belong to more than one executor Done http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/cluster-membership-mgr.cc@355 PS3, Line 355: exists = true; > This is repeated at multiple places in this function. May make sense to fac Done http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h File be/src/scheduling/executor-blacklist.h: http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h@32 PS3, Line 32: > FindAndRemove() Done http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h@38 PS3, Line 38: When a blacklisted executor has passed this timeout : /// and Maintenance() is called, the executor is put on 'probation' > May help to also document what if an executor on blacklist is removed durin Done http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h@42 PS3, Line 42: There > nit: typo Done http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h@79 PS3, Line 79: > May be clearer to call it "probation_list" or something so it can relates t Done http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.h@124 PS3, Line 124: > typo Done http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc File be/src/scheduling/executor-blacklist.cc: http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@29 PS3, Line 29: 0 disables blacklisting > This seems a bit weird. Setting it to 0 means the blacklist timeout == stat My problem with setting it to 120 is that it makes it possible for users to set it to < 100, which doesn't seem like it would ever be the right thing to do (eg. for executors that go down, it would make it very likely they would be incorrectly unblacklisted). Maybe it would be better to just add another flag, blacklisting_enabled or similar. I was trying to minimize the number of flags, but I guess it doesn't really matter. I don't think we'll be formally documenting these or expecting customers to commonly set them, they're just intended as emergency safety valves. Or we could do something like disable blacklisting if this is negative. http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@39 PS3, Line 39: ExecutorBlacklist::Blacklist(const TBackendDescriptor& be_desc) { > Should this check if blacklisting is disabled at the beginning ? I put the check in ClusterMembershipMgr::Blacklist to avoid copying the Snapshot if blacklisting is disabled. I'll at least add a DCHECK here. http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@54 PS3, Line 54: nicMillis() > Why not MonotonicMillis() ? This field shouldn't need to be shared across m Done http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@83 PS3, Line 83: if (remove_it == be_descs.end()) { : // Executor wasn't on the blacklist. : return NOT_BLACKLISTED; : } : St > This seems to be used at more than one functions. May be worth factoring it Done http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@99 PS3, Line 99: for (auto entry_it : e > Is remove_it still valid after calling erase above ? Done http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@110 PS3, Line 110: > put on probation Done http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@138 PS3, Line 138: > remove it from the blacklist. I think this comment is correct? This branch is handling things that were on probation http://gerrit.cloudera.org:8080/#/c/13868/3/be/src/scheduling/executor-blacklist.cc@184 PS3, Line 184: > "put on probation" ? Done http://gerrit.cloudera.org:8080/#/c/13868/3/tests/custom_cluster/test_blacklist.py File tests/custom_cluster/test_blacklist.py: http://gerrit.cloudera.org:8080/#/c/13868/3/tests/custom_cluster/test_blacklist.py@62 PS3, Line 62: (killed_impalad.hostname, killed_impalad.service.be_port), result.runtime_profile > Does it make sense to also add a test case when the node is re-added back t Added a test for restarting an impalad. Agreed that we should have a test for transient failure. Its trickier since such a debug action doesn't exist yet. I was planning on doing that in a follow up patch, but I can include it in this patch if you prefer -- To view, visit http://gerrit.cloudera.org:8080/13868 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iacb6e73b84042c33cd475b82470a975d04ee9b74 Gerrit-Change-Number: 13868 Gerrit-PatchSet: 4 Gerrit-Owner: Thomas Tauber-Marshall <tmarsh...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Thomas Tauber-Marshall <tmarsh...@cloudera.com> Gerrit-Comment-Date: Thu, 25 Jul 2019 22:35:33 +0000 Gerrit-HasComments: Yes