Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16369 )
Change subject: IMPALA-9636: Don't run retried query on the blacklisted nodes ...................................................................... Patch Set 4: (5 comments) http://gerrit.cloudera.org:8080/#/c/16369/8/be/src/runtime/query-driver.cc File be/src/runtime/query-driver.cc: http://gerrit.cloudera.org:8080/#/c/16369/8/be/src/runtime/query-driver.cc@128 PS8, Line 128: // set blacklisted_backend_ids for client_request_state of the original query. : if (blacklisted_backend_ids != nullptr && !blacklisted_backend_ids->empty()) { : client_request_state_->SetBlacklistedBackendIds(*blacklisted_backend_ids); : } : : // Triggering a retry from the INITIALIZED phase is possible: the : / instead of passing blacklisted_executor_addresses into this method, it seems cleaner to just call 'parent_request_state->SetBlacklistedExecutorAddresses()' inside coordinator.cc - in fact, you can replace SetBlacklistedExecutorAddresses with AddBlacklistedExecutorAddresses that way you don't have to built up the unordered_set of addresses. http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/scheduling/admission-controller.cc File be/src/scheduling/admission-controller.cc: http://gerrit.cloudera.org:8080/#/c/16369/4/be/src/scheduling/admission-controller.cc@1572 PS4, Line 1572: chedule_executor_group != executor_group) > Done right now, if a query is retried and all executors are blacklisted, the query will get queued by the admission controller and will eventually fail with the error: "Admission for query exceeded timeout 60000ms in pool default-pool. Queued reason: Waiting for executors to start. Only DDL queries and queries scheduled only on the coordinator (either NUM_NODES set to 1 or when small query optimization is triggered) can currently run. Additional Details: Not Applicable" if the blacklisting here removes all executors then the query will fail with "Cannot schedule query: no registered backends available", correct? is it possible to keep this consistent with the other blacklisting logic and print out "Admission for query exceeded timeout 60000ms in pool default-pool ...". http://gerrit.cloudera.org:8080/#/c/16369/8/be/src/scheduling/admission-controller.cc File be/src/scheduling/admission-controller.cc: http://gerrit.cloudera.org:8080/#/c/16369/8/be/src/scheduling/admission-controller.cc@1120 PS8, Line 1120: if (!request.blacklisted_backend_ids.empty()) { : // This debug_action is to allow delay to be injected for the admission of a retried : // query, which is triggered by blacklisted node, before getting membership snapshot. : DebugActionNoFail(FLAGS_debug_actions, "ADMISSION_DELAY_FOR_RETRIED_QUERY"); : } can you use CRS_BEFORE_ADMISSION instead? http://gerrit.cloudera.org:8080/#/c/16369/8/be/src/scheduling/scheduler.cc File be/src/scheduling/scheduler.cc: http://gerrit.cloudera.org:8080/#/c/16369/8/be/src/scheduling/scheduler.cc@163 PS8, Line 163: DCHECK(executor_config.group.NumExecutors() > 0 || exec_at_coord); : : const TPlanNode& node = stat not sure I fully understand why the DCHECK needs to be removed, could you expand on this a bit please? http://gerrit.cloudera.org:8080/#/c/16369/8/tests/custom_cluster/test_query_retries.py File tests/custom_cluster/test_query_retries.py: http://gerrit.cloudera.org:8080/#/c/16369/8/tests/custom_cluster/test_query_retries.py@287 PS8, Line 287: statestored_args="--statestore_heartbeat nice! I think this actually solves IMPALA-9227 as well. -- To view, visit http://gerrit.cloudera.org:8080/16369 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I00bc1b5026efbd0670ffbe57bcebc457d34cb105 Gerrit-Change-Number: 16369 Gerrit-PatchSet: 4 Gerrit-Owner: Wenzhe Zhou <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-Reviewer: Wenzhe Zhou <[email protected]> Gerrit-Comment-Date: Tue, 08 Sep 2020 17:56:21 +0000 Gerrit-HasComments: Yes
