Andrew Sherman has posted comments on this change. ( http://gerrit.cloudera.org:8080/17332 )
Change subject: IMPALA-9155: Add recovery mechanism to admission service ...................................................................... Patch Set 1: (10 comments) Some more comments http://gerrit.cloudera.org:8080/#/c/17332/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17332/1//COMMIT_MSG@9 PS1, Line 9: Major changes: I wrote my own version of the commit message to see that I understand the change. Is this correct? Feel free to use this All heartbeats from the coordinator already contain a unique id for the coordinator. The admissiond uses this unique id to track the coordinators that have sent their full admission state. When the admissiond starts up (in particular when it restarts) it will not not service any requests from coordinators until it has received the full admission state of that coordinator. The trigger for coordinators to resend their full admission state is a new flag set in replies to the AdmissionHeartbeat messages that every coordinator periodically sends to the admissiond. If the new flag is set in the reply that the coordinator receives then the full admission state is sent to the admissiond. Note that when the admissiond restarts it will service requests from any coordinator that has sent its full admission state. This means that over-admission is possible during the period when not every coordinator has sent its full admission state. http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-client.cc File be/src/scheduling/admission-control-client.cc: http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-client.cc@32 PS1, Line 32: "Re-submit for admission due to a possible admission service restart"; Is there a reason it is only a "possible" restart? http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-service.cc File be/src/scheduling/admission-control-service.cc: http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-service.cc@304 PS1, Line 304: LOG(WARNING) << "Received heartbeat from unrecognized coord_id=" << req->coord_id(); Maybe WARNING is too high, we do expect to see this after a restart, maybe just make the message say that? http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-service.cc@332 PS1, Line 332: LOG(INFO) << "Query id already exists. Can be lingering from old coord_id=" Is it possible that the admissionstate we have just received is more up-to-date than that in admission_state_map_ ? http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-service.cc@449 PS1, Line 449: LOG(INFO) << "Adding to list of known coordinators, coord_id=" << coord_id; The size of known_coord_ids_ is interesting, maybe add it to the message. http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/remote-admission-control-client.h File be/src/scheduling/remote-admission-control-client.h: http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/remote-admission-control-client.h@54 PS1, Line 54: /// TODO: add info on what this does? here or in the class comments Yes it seems like these methods don't have descriptions http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/remote-admission-control-client.cc File be/src/scheduling/remote-admission-control-client.cc: http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/remote-admission-control-client.cc@131 PS1, Line 131: retry_admission = true; I think this was already set to true? http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/remote-admission-control-client.cc@158 PS1, Line 158: //is network error or admisiond did not recognize this coord. Nit: "admissiond" http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py File tests/custom_cluster/test_admission_controller.py: http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py@1604 PS1, Line 1604: def __compare_admission_json(self, json1, json2): This is a clever way to test this change. http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py@1606 PS1, Line 1606: # that are updated immediately and admission changes are compared. "and admission changes" is slightly unclear, maybe "is changed by admissiond" -- To view, visit http://gerrit.cloudera.org:8080/17332 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8ad3ef9b9e2496c484833d6326ce914c851e02fd Gerrit-Change-Number: 17332 Gerrit-PatchSet: 1 Gerrit-Owner: Bikramjeet Vig <[email protected]> Gerrit-Reviewer: Andrew Sherman <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Comment-Date: Fri, 28 May 2021 19:46:41 +0000 Gerrit-HasComments: Yes
