Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/10060 )
Change subject: IMPALA-5216: Make admission control queuing async ...................................................................... Patch Set 7: (27 comments) I need to dig into the details some more but some initial comments/questions. http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/scheduling/admission-controller.h File be/src/scheduling/admission-controller.h: http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/scheduling/admission-controller.h@44 PS7, Line 44: ClientRequestState::admit_outcome_ if this enum is meant to be consumed publicly, then it's best to not refer to this private variable when explaining it. let's try to explain it in terms of the public portion of the AdmissionController abstraction. http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/scheduling/admission-controller.h@201 PS7, Line 201: admit_status admit_outcome? http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/scheduling/admission-controller.h@204 PS7, Line 204: admit_status admit_outcome http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/service/client-request-state.h File be/src/service/client-request-state.h: http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/service/client-request-state.h@72 PS7, Line 72: /// Must *not* be called with lock_ held. given that the coordinator object is (unfortunately) current exposed by this class, let's document when, and in which cases, that gets created. http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/service/client-request-state.h@114 PS7, Line 114: non-error states (RUNNING_STATE and FINISHED_STATE) is PENDING_STATE one of those? http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/service/client-request-state.h@152 PS7, Line 152: coord_ let's try not to talk about private members when documenting the public interface (so that we can try to make sense of the abstractions without knowing all the implementation details). http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/service/client-request-state.h@172 PS7, Line 172: Coordinator* coord() const { : return coord_exec_started_.Load() ? coord_.get() : nullptr; : } it's confusing that coord() is not actually an accessor given it's name. If we really need this then we should name it GetCoord() or something. But, it seems weird that we need this. We really need to cleanup the CRS states, but in the mean time, why can't we just defer setting coord_ until after coord->Exec() returns (and maybe hold the lock while setting coord_ if that makes the synchronization simpler)? http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/service/client-request-state.h@323 PS7, Line 323: Exec() has been successfully called on it why do we bother creating the coordinator object ahead of when we call Exec()? i.e. what use is it before Exec() is called? http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/service/impala-hs2-server.cc File be/src/service/impala-hs2-server.cc: http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/service/impala-hs2-server.cc@448 PS7, Line 448: DCHECK(status.ok() || !request_state->uses_admission_control()); why is that the case, and why does this code care? http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/service/impala-hs2-server.cc@464 PS7, Line 464: if (!status.ok()) goto return_error; if we hit an error here, how does the WaitAsync thread finish? http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/service/impala-http-handler.cc File be/src/service/impala-http-handler.cc: http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/service/impala-http-handler.cc@769 PS7, Line 769: == nullptr not your change, and let's not fix it here, but given that the coord_ ptr is set outside of holding the lock, these checks aren't really technically safe. This is another example of why we should tighten up the specification of the CRS states. http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/service/impala-http-handler.cc@769 PS7, Line 769: uses_admission_control( should that check for PENDING_STATE instead, to be more specific? http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/service/impala-http-handler.cc@778 PS7, Line 778: request_state->coord() != nullptr this would also be more about the states, but okay to leave it for now. http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/service/impala-http-handler.cc@779 PS7, Line 779: request_state->coord()->GetTExecSummary(&summary); I guess PrintExecSummary() is okay with an empty summary? have you tested that? I think previously the summary would at least be initialized (though given that the CRS lock isn't held across setting the coord_ ptr and calling Exec(), there was a small window previously that could also lead to this empty case). http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/service/impala-server.h File be/src/service/impala-server.h: http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/service/impala-server.h@432 PS7, Line 432: It represents the set of queries that : /// have either started execution or are queued and can be cancelled. if this is accurate, maybe say instead: It represents the set of queries that can be cancelled: either the query is queued or has started executing. http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/service/impala-server.h@434 PS7, Line 434: submitted_queries alternatively, if we clean up the state machine and how cancel works, it seems like we might be able to get rid of this altogether? not for this commit though. http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/service/impala-server.h@499 PS7, Line 499: changes when the query : /// is set in-flight. preexisting, but I'm not really sure what this is saying. http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/service/impala-server.cc@a910 PS7, Line 910: is it the right thing to delay this? by delaying it it means that that the cluster membership topic update won't cancel queries that are queued to run against these hosts. I guess I can see arguments either way, but it seems safer to keep the behavior of cancelling them proactively, no? http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/service/impala-server.cc@663 PS7, Line 663: return Status::OK(); What does this look like in the various places that get the summary? Should it instead return a status message saying the query is current queued in admission control? http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/service/impala-server.cc@875 PS7, Line 875: if (FLAGS_stress_metadata_loading_pause_injection_ms > 0) { : SleepForMs(FLAGS_stress_metadata_loading_pause_injection_ms); : } this should probably use your new debug action thingy, but okay to do that in a follow on commit. http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/util/debug-util.h File be/src/util/debug-util.h: http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/util/debug-util.h@112 PS7, Line 112: "sleep_option" and then you could change: "sleep_option" input variable matches to just: 'sleep_option' matches http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/util/debug-util.h@112 PS7, Line 112: sleep_option:sleep_time_ms since we usually use single quotes to document function args, this might be clearer as: ... <sleep_option>:<sleep_time_ms> where <sleep_option> is a string and <sleep_time_ms> is an integer ... http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/util/debug-util.h@114 PS7, Line 114: 'sleep_time_ms then also make this: <sleep_time_ms> http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/util/debug-util.h@115 PS7, Line 115: string const string& http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/util/promise.h File be/src/util/promise.h: http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/util/promise.h@73 PS7, Line 73: ( nit: missing space http://gerrit.cloudera.org:8080/#/c/10060/7/be/src/util/promise.h@77 PS7, Line 77: return val_; given the subtlty of the lock as commented above, maybe combine these into a SetWork<ALLOW_MULTIPLE> where you just put a DCHECK(ALLOW_MULTIPLE) in the if-branch. I don't feel too strongly about it though. http://gerrit.cloudera.org:8080/#/c/10060/7/common/thrift/ImpalaService.thrift File common/thrift/ImpalaService.thrift: http://gerrit.cloudera.org:8080/#/c/10060/7/common/thrift/ImpalaService.thrift@90 PS7, Line 90: "sleep_option:sleep_time_ms", to keep with the a consistent syntax as #1, I think this would be: "<sleep option>:<sleep time ms>" -- To view, visit http://gerrit.cloudera.org:8080/10060 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I989cf5b259afb8f5bc5c35590c94961c81ce88bf Gerrit-Change-Number: 10060 Gerrit-PatchSet: 7 Gerrit-Owner: Bikramjeet Vig <[email protected]> Gerrit-Reviewer: Bikramjeet Vig <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Wed, 09 May 2018 23:34:00 +0000 Gerrit-HasComments: Yes
