Wenzhe Zhou has posted comments on this change. ( http://gerrit.cloudera.org:8080/16369 )
Change subject: IMPALA-9636: Don't run retried query on the blacklisted nodes ...................................................................... Patch Set 5: (19 comments) http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/runtime/coordinator.h File be/src/runtime/coordinator.h: http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/runtime/coordinator.h@615 PS4, Line 615: std::unordered_set<NetworkAddr > Use of & would be better. fixed http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/runtime/coordinator.h@620 PS4, Line 620: t > Same comment. fixed http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/runtime/coordinator.cc File be/src/runtime/coordinator.cc: http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/runtime/coordinator.cc@1043 PS4, Line 1043: > Use of & here is better. fixed http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/runtime/coordinator.cc@1114 PS4, Line 1114: R > Same comment. Use reference. fixed http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/runtime/query-driver.cc File be/src/runtime/query-driver.cc: http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/runtime/query-driver.cc@128 PS4, Line 128: > nit: Set fixed http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/runtime/query-driver.cc@241 PS4, Line 241: trin > Maybe use the word 'Transfer'. fixed http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/runtime/query-driver.cc@241 PS4, Line 241: > query to be retried? fixed http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/scheduling/admission-controller.cc File be/src/scheduling/admission-controller.cc: http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/scheduling/admission-controller.cc@1572 PS4, Line 1572: iltered_executor_group != executor_group) > If all nodes are blacklisted, then there is no way to run the query again. We should not handle the case with number of executor as 0 here since a query could be executed if it could be executed on coordinator only. If there is no available executor and it could not be executed on coordinator,,the function schedule() called in line 1570 return error. In the case, we should also release the temporary executor group. Fixed it by removing RETURN_IF_ERROR when calling schedule(). http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/scheduling/admission-controller.cc@1582 PS4, Line 1582: > Not sure why the temporary executor group is destroyed here. Are we suppose It's not used after making schedule so release here. http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/scheduling/executor-group.h File be/src/scheduling/executor-group.h: http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/scheduling/executor-group.h@54 PS4, Line 54: except > nit: maybe reword as 'excluding' fixed http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/scheduling/executor-group.cc File be/src/scheduling/executor-group.cc: http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/scheduling/executor-group.cc@42 PS4, Line 42: find_blacklisted_executor > nit: reword as blacklisted_executor_found fixed http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/scheduling/executor-group.cc@46 PS4, Line 46: k; : } > Use std::ordered_set::contains() or std::ordered_set::count() probably woul Code was changed in patch 5 so we will not search in blacklisted_backend_ids. http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/scheduling/executor-group.cc@60 PS4, Line 60: > same comment as above. code was changed in patch 5. http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/service/client-request-state.h File be/src/service/client-request-state.h: http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/service/client-request-state.h@396 PS4, Line 396: > Use reference &. fixed http://gerrit.cloudera.org:8080/#/c/16369/4/tests/custom_cluster/test_query_retries.py File tests/custom_cluster/test_query_retries.py: http://gerrit.cloudera.org:8080/#/c/16369/4/tests/custom_cluster/test_query_retries.py@37 PS4, Line 37: the tests simulate rpc errors at > nit: reword as "to simulate rpc errors in tests". fixed http://gerrit.cloudera.org:8080/#/c/16369/4/tests/custom_cluster/test_query_retries.py@291 PS4, Line 291: 3-rd > nit: 3rd. fixed http://gerrit.cloudera.org:8080/#/c/16369/4/tests/custom_cluster/test_query_retries.py@294 PS4, Line 294: > nit: assigned to any fragment instances fixed http://gerrit.cloudera.org:8080/#/c/16369/4/tests/custom_cluster/test_query_retries.py@302 PS4, Line 302: Add admission delay > I wonder why a delay is added here. For this test, we have to simulate blacklist timeout. By adding delay here with debug_action, the 3rd node will be removed from blacklist after timeout and added back to executor group before we get the snapshot of cluster membership. http://gerrit.cloudera.org:8080/#/c/16369/4/tests/custom_cluster/test_query_retries.py@332 PS4, Line 332: > Maybe add two more test cases: We cannot simulate all nodes are blacklisted due to RPC failure since coordinator itself is an executor. The case for two nodes are blacklisted does not add additional code coverage. The case for empty executor group is already covered by scheduler unit-test. -- To view, visit http://gerrit.cloudera.org:8080/16369 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I00bc1b5026efbd0670ffbe57bcebc457d34cb105 Gerrit-Change-Number: 16369 Gerrit-PatchSet: 5 Gerrit-Owner: Wenzhe Zhou <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-Reviewer: Wenzhe Zhou <[email protected]> Gerrit-Comment-Date: Thu, 03 Sep 2020 18:42:14 +0000 Gerrit-HasComments: Yes
