Wenzhe Zhou has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16369 )

Change subject: IMPALA-9636: Don't run retried query on the blacklisted nodes
......................................................................


Patch Set 5:

(19 comments)

http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/runtime/coordinator.h
File be/src/runtime/coordinator.h:

http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/runtime/coordinator.h@615
PS4, Line 615: std::unordered_set<NetworkAddr
> Use of & would be better.
fixed


http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/runtime/coordinator.h@620
PS4, Line 620: t
> Same comment.
fixed


http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/runtime/coordinator.cc
File be/src/runtime/coordinator.cc:

http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/runtime/coordinator.cc@1043
PS4, Line 1043:
> Use of & here is better.
fixed


http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/runtime/coordinator.cc@1114
PS4, Line 1114: R
> Same comment. Use reference.
fixed


http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/runtime/query-driver.cc
File be/src/runtime/query-driver.cc:

http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/runtime/query-driver.cc@128
PS4, Line 128:
> nit: Set
fixed


http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/runtime/query-driver.cc@241
PS4, Line 241: trin
> Maybe use the word 'Transfer'.
fixed


http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/runtime/query-driver.cc@241
PS4, Line 241:
> query to be retried?
fixed


http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/scheduling/admission-controller.cc
File be/src/scheduling/admission-controller.cc:

http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/scheduling/admission-controller.cc@1572
PS4, Line 1572: iltered_executor_group != executor_group)
> If all nodes are blacklisted, then there is no way to run the query again.
We should not handle the case with number of executor as 0 here since a query 
could be executed if it could be executed on coordinator only.
If there is no available executor and it could not be executed on 
coordinator,,the function schedule() called in line 1570 return error. In the 
case, we should also release the temporary executor group. Fixed it by removing 
RETURN_IF_ERROR when calling schedule().


http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/scheduling/admission-controller.cc@1582
PS4, Line 1582:
> Not sure why the temporary executor group is destroyed here. Are we suppose
It's not used after making schedule so release here.


http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/scheduling/executor-group.h
File be/src/scheduling/executor-group.h:

http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/scheduling/executor-group.h@54
PS4, Line 54: except
> nit: maybe reword as 'excluding'
fixed


http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/scheduling/executor-group.cc
File be/src/scheduling/executor-group.cc:

http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/scheduling/executor-group.cc@42
PS4, Line 42: find_blacklisted_executor
> nit: reword as blacklisted_executor_found
fixed


http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/scheduling/executor-group.cc@46
PS4, Line 46: k;
            :     }
> Use std::ordered_set::contains() or std::ordered_set::count() probably woul
Code was changed in patch 5 so we will not search in blacklisted_backend_ids.


http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/scheduling/executor-group.cc@60
PS4, Line 60:
> same comment as above.
code was changed in patch 5.


http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/service/client-request-state.h
File be/src/service/client-request-state.h:

http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/service/client-request-state.h@396
PS4, Line 396:
> Use reference &.
fixed


http://gerrit.cloudera.org:8080/#/c/16369/4/tests/custom_cluster/test_query_retries.py
File tests/custom_cluster/test_query_retries.py:

http://gerrit.cloudera.org:8080/#/c/16369/4/tests/custom_cluster/test_query_retries.py@37
PS4, Line 37: the tests simulate rpc errors at
> nit: reword as "to simulate rpc errors in tests".
fixed


http://gerrit.cloudera.org:8080/#/c/16369/4/tests/custom_cluster/test_query_retries.py@291
PS4, Line 291: 3-rd
> nit: 3rd.
fixed


http://gerrit.cloudera.org:8080/#/c/16369/4/tests/custom_cluster/test_query_retries.py@294
PS4, Line 294:  
> nit: assigned to any fragment instances
fixed


http://gerrit.cloudera.org:8080/#/c/16369/4/tests/custom_cluster/test_query_retries.py@302
PS4, Line 302: Add admission delay
> I wonder why a delay is added here.
For this test, we have to simulate blacklist timeout. By adding delay here with 
debug_action, the 3rd node will be removed from blacklist after timeout and 
added back to executor group before we get the snapshot of cluster membership.


http://gerrit.cloudera.org:8080/#/c/16369/4/tests/custom_cluster/test_query_retries.py@332
PS4, Line 332:
> Maybe add two more test cases:
We cannot simulate all nodes are blacklisted due to RPC failure since 
coordinator itself is an executor. The case for two nodes are blacklisted does 
not add additional code coverage. The case for empty executor group is already 
covered by scheduler unit-test.



--
To view, visit http://gerrit.cloudera.org:8080/16369
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I00bc1b5026efbd0670ffbe57bcebc457d34cb105
Gerrit-Change-Number: 16369
Gerrit-PatchSet: 5
Gerrit-Owner: Wenzhe Zhou <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Reviewer: Wenzhe Zhou <[email protected]>
Gerrit-Comment-Date: Thu, 03 Sep 2020 18:42:14 +0000
Gerrit-HasComments: Yes

Reply via email to