Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13550 )

Change subject: IMPALA-8484: Run queries on disjoint executor groups
......................................................................


Patch Set 9:

(22 comments)

I got through the core of the admission controller changes. Pushing out a batch 
of comments

http://gerrit.cloudera.org:8080/#/c/13550/9/be/src/runtime/exec-env.h
File be/src/runtime/exec-env.h:

http://gerrit.cloudera.org:8080/#/c/13550/9/be/src/runtime/exec-env.h@268
PS9, Line 268: admit_num_limit_
Maybe admit_num_queries_limit? Probably a matter of taste but it would be more 
explicit what the number is of.


http://gerrit.cloudera.org:8080/#/c/13550/9/be/src/runtime/exec-env.cc
File be/src/runtime/exec-env.cc:

http://gerrit.cloudera.org:8080/#/c/13550/9/be/src/runtime/exec-env.cc@289
PS9, Line 289: 8
Consider making it a constant up the top of the file.


http://gerrit.cloudera.org:8080/#/c/13550/9/be/src/scheduling/admission-controller.h
File be/src/scheduling/admission-controller.h:

http://gerrit.cloudera.org:8080/#/c/13550/9/be/src/scheduling/admission-controller.h@158
PS9, Line 158: but
             : /// multiple groups can run queries of the same pool
I found this a little tricky to parse. Here's one alternative phrasing.

  A resource pool can have multiple executor groups associated with it. Each 
executor group belongs to a single resource pool and will only serve requests 
from that pool. I.e. the relationship is 1 resource pool : many executor groups.


http://gerrit.cloudera.org:8080/#/c/13550/9/be/src/scheduling/admission-controller.h@283
PS9, Line 283:   struct AdmissionRequest {
Can you mention the lifetime/ownership of the objects referred to here. Assume 
this are allow owned directly or indirectly by CLietnRequestState?


http://gerrit.cloudera.org:8080/#/c/13550/9/be/src/scheduling/admission-controller.h@592
PS9, Line 592:     const ExecutorGroup* executor_group;
Can this be NULL? Maybe should be a reference if non-nullable. Or if nullable 
that is worth documenting in a comment. Maybe this will become clear later on.


http://gerrit.cloudera.org:8080/#/c/13550/9/be/src/scheduling/admission-controller.h@595
PS9, Line 595:   typedef std::vector<GroupSchedule> GroupSchedules;
This is mostly a matter of taste, but I have a bit of a bias towards just using 
the full type everywhere so there's less indirection.


http://gerrit.cloudera.org:8080/#/c/13550/9/be/src/scheduling/admission-controller.h@628
PS9, Line 628:     /// not be valid once the request has been admitted, 
cancelled, or rejected.
nit: I found it a little tricky to see the delimitation of the groups of 
members. I found the below style of delimination used in some ExecNodes to be 
useful in some cases.

  /////////////////////////////////////////
  /// BEGIN: Members that must be Reset()

  ...
  /// END: Members that must be Reset()
  /////////////////////////////////////////


http://gerrit.cloudera.org:8080/#/c/13550/9/be/src/scheduling/admission-controller.h@700
PS9, Line 700:   /// 'membership_snapshot' has changed.
Document in what cases it might fail?


http://gerrit.cloudera.org:8080/#/c/13550/9/be/src/scheduling/admission-controller.h@707
PS9, Line 707:   /// false and sets queue_node->admitted_schedule if the query 
can be admitted. Returns
I found the output boolean a little confusing. I don't know if switching the 
polarity helps (i.e. false means reject, true means admit or queue). I thought 
about suggesting using an enum for the output, but that may be overcomplicating 
things.


http://gerrit.cloudera.org:8080/#/c/13550/9/be/src/scheduling/admission-controller.h@752
PS9, Line 752: unavailable_reason
'unavailable_reason' for consistency.


http://gerrit.cloudera.org:8080/#/c/13550/9/be/src/scheduling/admission-controller.cc
File be/src/scheduling/admission-controller.cc:

http://gerrit.cloudera.org:8080/#/c/13550/9/be/src/scheduling/admission-controller.cc@509
PS9, Line 509:     if (num_admitted + 1 > admit_num_limit) {
Maybe: num_admitted >= admin_num_limit


http://gerrit.cloudera.org:8080/#/c/13550/9/be/src/scheduling/admission-controller.cc@523
PS9, Line 523:   //  (b) The executor group is already at the maximum number of 
requests.
This should be "One of the executors is already at its maximum number of 
requests", I think


http://gerrit.cloudera.org:8080/#/c/13550/9/be/src/scheduling/admission-controller.cc@561
PS9, Line 561: orth it?
Probably not?


http://gerrit.cloudera.org:8080/#/c/13550/9/be/src/scheduling/admission-controller.cc@738
PS9, Line 738:     return Status("AC: Local backend has not yet started");
Is this reachable? I spent a couple of minutes trying to understand how we'd 
get into this state but couldn't figure it out - I'm assuming it would have to 
be if the server had started but not received the initial cluster membership? 
Since it looks like local_be_desc is set in UpdateMembership(). Although I'd 
think in that case that we'd want to queue the query instead of failing it.


Can we test this code path somehow if it's reachable?


http://gerrit.cloudera.org:8080/#/c/13550/9/be/src/scheduling/admission-controller.cc@747
PS9, Line 747:       
request_pool_service_->ResolveRequestPool(request.request.query_ctx, 
&pool_name));
I think the comment in scheduler.cc was useful. I didn't figure out the point 
of this until I looked at the old comment:

  // Re-resolve the pool name to propagate any resolution errors now that this 
request
  // is known to require a valid pool.


http://gerrit.cloudera.org:8080/#/c/13550/9/be/src/scheduling/admission-controller.cc@752
PS9, Line 752:   
RETURN_IF_ERROR(request_pool_service_->GetPoolConfig(pool_name, &pool_cfg));
I think we coudl factor out the ResolveRequestPool()/GetPoolConfig() logic into 
a cohesive helper function to reduce the noise in this function


http://gerrit.cloudera.org:8080/#/c/13550/9/be/src/scheduling/admission-controller.cc@1096
PS9, Line 1096:     // TODO: Add schedule c'tor for AdmissionRequest?
Not a bad idea, but might not be worth adding the circular dependency between 
scheduler/admission controller?


http://gerrit.cloudera.org:8080/#/c/13550/9/be/src/scheduling/admission-controller.cc@1162
PS9, Line 1162:     if (RejectForSchedule(
The behaviour is documented in the method comment "If the query is unable to 
run on
  /// any of the groups irrespective of their current workload, it is 
rejected." but it might be worth re-iterating here to explain the early exit, 
e.g.

  // Query is rejected if the rejection check fails on *any* group.


http://gerrit.cloudera.org:8080/#/c/13550/9/be/src/scheduling/admission-controller.cc@1235
PS9, Line 1235:     if (membership_snapshot->executor_groups.empty()) continue;
Add a brief comment to explain what case this handles?


http://gerrit.cloudera.org:8080/#/c/13550/9/be/src/scheduling/admission-controller.cc@1255
PS9, Line 1255:       while (max_to_dequeue > 0 && !queue.empty()) {
Thanks for the effort in making this loop easy to understand.


http://gerrit.cloudera.org:8080/#/c/13550/9/be/src/scheduling/admission-controller.cc@1704
PS9, Line 1704: std::
nit: no std::


http://gerrit.cloudera.org:8080/#/c/13550/9/www/backends.tmpl
File www/backends.tmpl:

http://gerrit.cloudera.org:8080/#/c/13550/9/www/backends.tmpl@34
PS9, Line 34:       <th>Query Limit for Admission</th>
Maybe "Num. Queries". I think it could be misread otherwise.



--
To view, visit http://gerrit.cloudera.org:8080/13550
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8a1d0900f2a82bd2fc0a906cc094e442cffa189b
Gerrit-Change-Number: 13550
Gerrit-PatchSet: 9
Gerrit-Owner: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <asher...@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bikramjeet....@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Comment-Date: Wed, 17 Jul 2019 00:12:53 +0000
Gerrit-HasComments: Yes

Reply via email to