Bikramjeet Vig has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17332 )

Change subject: IMPALA-9976 IMPALA-10866: Add recovery mechanism to admission 
service and fix consistency between coord failure detection and registration
......................................................................


Patch Set 2:

(19 comments)

http://gerrit.cloudera.org:8080/#/c/17332/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17332/1//COMMIT_MSG@9
PS1, Line 9:
> I wrote my own version of the commit message to see that I understand the c
Pretty much. The only thing I would add is that if a backend is marked as down 
by the statestore, it will also have to register(send full admission state) 
again with the admissiond to be able to be serviced


http://gerrit.cloudera.org:8080/#/c/17332/1//COMMIT_MSG@12
PS1, Line 12: - Leverages the admission heartbeat mechanism to signal the
> Nit" should this be "No RPCs are serviced from a coordinator until it has s
Done


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-client.cc
File be/src/scheduling/admission-control-client.cc:

http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-client.cc@32
PS1, Line 32:     "Re-submit for admission due to a possible admission service 
restart or network "
> Is there a reason it is only a "possible" restart?
as this can also be due to a generic network error that prevents the Admit RPC 
to go through.
Added this to the error msg. Open to suggestion for a better message text


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-service.cc
File be/src/scheduling/admission-control-service.cc:

http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-service.cc@304
PS1, Line 304:     LOG(INFO) << "Received heartbeat from unrecognized 
coord_id=" << req->coord_id();
> Maybe WARNING is too high, we do expect to see this after a restart, maybe
Just a log line at the INFO level should suffice in that case


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-service.cc@332
PS1, Line 332:   bool registered_on_ac_service =
> Is it possible that the admissionstate we have just received is more up-to-
I have made both the RebuildAdmissionState and  
CancelQueriesOnFailedCoordinators atomic operations so there should be no 
inconsistency anymore. Also if the coord is already registered it would return 
an OK status. In that case as well there should be no inconsistency as all RPCs 
after the first successful call to RebuildAdmissionState would be serviced and 
ideally any subsequent calls for RebuildAdmissionState should just be previous 
retries.


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-control-service.cc@449
PS1, Line 449:     lock_guard<mutex> l(admission_state->lock);
> The size of known_coord_ids_ is interesting, maybe add it to the message.
Done


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-controller.cc
File be/src/scheduling/admission-controller.cc:

http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-controller.cc@2103
PS1, Line 2103:   DCHECK(
> line too long (93 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/admission-controller.cc@2104
PS1, Line 2104:       num_backends_to_release_.find(state->query_id()) == 
num_backends_to_release_.end());
> line too long (92 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/remote-admission-control-client.h
File be/src/scheduling/remote-admission-control-client.h:

http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/remote-admission-control-client.h@54
PS1, Line 54:   /// TODO: add info on what this does? here or in the class 
comments
> Yes it seems like these methods don't have descriptions
will add these in the next update


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/remote-admission-control-client.cc
File be/src/scheduling/remote-admission-control-client.cc:

http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/remote-admission-control-client.cc@131
PS1, Line 131:         // retry_admission can be false if 
AttemptAdmissionAndWait succeeded but a
> I think this was already set to true?
This might not be set to true if the RPC succeeded but the impala-server 
initiated a ResetPendingAdmit. This can happen in the following case:
- the RPCs (both admit and getQueryStatus) succeeded on a previous instance of 
the admissiond
- The coordinator re-registered with an ongoing (old or a new restarted) 
instance of the admissiond.


http://gerrit.cloudera.org:8080/#/c/17332/1/be/src/scheduling/remote-admission-control-client.cc@158
PS1, Line 158:   while (admit_rpc_status.IsNetworkError()
> Nit: "admissiond"
Done


http://gerrit.cloudera.org:8080/#/c/17332/1/common/protobuf/admission_control_service.proto
File common/protobuf/admission_control_service.proto:

http://gerrit.cloudera.org:8080/#/c/17332/1/common/protobuf/admission_control_service.proto@280
PS1, Line 280:   /// quickly, eg. if the query is rejected. We should evaluate 
the benefits of saving a
> Move new comment to before the TODO to clarify that it is not part of the T
Done


http://gerrit.cloudera.org:8080/#/c/17332/1/common/protobuf/admission_control_service.proto@301
PS1, Line 301:   /// admissiond started. TODO: we can save an rpc if we combine 
the release of the final
> Move new comment to before the TODO to clarify that it is not part of the T
Done


http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py
File tests/custom_cluster/test_admission_controller.py:

http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py@1578
PS1, Line 1578: :
> flake8: E231 missing whitespace after ':'
Done


http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py@1585
PS1, Line 1585:
> flake8: E226 missing whitespace around arithmetic operator
Done


http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py@1589
PS1, Line 1589:
> flake8: E226 missing whitespace around arithmetic operator
Done


http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py@1606
PS1, Line 1606:     # that are updated immediately are compared.
> "and admission changes" is slightly unclear, maybe "is changed by admission
Done


http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py@1614
PS1, Line 1614:
> flake8: E501 line too long (99 > 90 characters)
Done


http://gerrit.cloudera.org:8080/#/c/17332/1/tests/custom_cluster/test_admission_controller.py@1626
PS1, Line 1626:
> flake8: E501 line too long (97 > 90 characters)
Done



--
To view, visit http://gerrit.cloudera.org:8080/17332
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8ad3ef9b9e2496c484833d6326ce914c851e02fd
Gerrit-Change-Number: 17332
Gerrit-PatchSet: 2
Gerrit-Owner: Bikramjeet Vig <[email protected]>
Gerrit-Reviewer: Andrew Sherman <[email protected]>
Gerrit-Reviewer: Bikramjeet Vig <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Comment-Date: Tue, 31 Aug 2021 21:48:27 +0000
Gerrit-HasComments: Yes

Reply via email to