Sahil Takiar has posted comments on this change. ( http://gerrit.cloudera.org:8080/14104 )
Change subject: IMPALA-8803: Coordinator should release admitted memory per-backend ...................................................................... Patch Set 7: (15 comments) http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/runtime/coordinator-backend-resource-state.cc File be/src/runtime/coordinator-backend-resource-state.cc: http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/runtime/coordinator-backend-resource-state.cc@81 PS7, Line 81: num_pending_releasable_or_released_ > Would it be simpler to track the negation (i.e. num_in_use_) instead ? We o Done http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/runtime/coordinator-backend-resource-state.cc@95 PS7, Line 95: is_coordinator_the_last_unreleased_backend > If I understand it correctly, there is no adverse effect for this heuristic Not sure I follow. Can a query run without a coordinator fragment? If there is no coordinator fragment, then this heuristic will always be false since there is no BackendState where 'is_coord_backend' is true. Although this won't affect the other heuristics. Worse case, that just means releasing admission control resources is delayed until the query is closed, which is what the current code does anyway. http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/runtime/coordinator-backend-resource-state.cc@106 PS7, Line 106: } > DCHECK_GE(num_pending_, 0); Done http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/runtime/coordinator-backend-resource-state.cc@117 PS7, Line 117: DCHECK( > DCHECK_NE Done http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/runtime/coordinator-backend-state.h File be/src/runtime/coordinator-backend-state.h: http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/runtime/coordinator-backend-state.h@457 PS7, Line 457: PENDING state > or RELESABLE ? I've revised the method description to make things a bit clearer. http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/runtime/coordinator-backend-state.h@460 PS7, Line 460: /// released. > A no-op if the BackendResourceState is closed already. Done http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/runtime/coordinator-backend-state.h@465 PS7, Line 465: /// BackendStates have been released. > This may be called even after CloseAndGetUnreleasedBackends(). Yes, revised the method comment. http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/runtime/coordinator-backend-state.h@509 PS7, Line 509: BackendResourceState is closed, > May help to clarify what may and may not happen after transitioning to the I've updated the method comments above, the class comment also has some notes about this. Let me know if you think those are sufficient, otherwise I can add some more. http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/scheduling/admission-controller.h File be/src/scheduling/admission-controller.h: http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/scheduling/admission-controller.h@851 PS7, Line 851: specificied > typo Done http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/scheduling/admission-controller.h@852 PS7, Line 852: UpdateStatsForHost( > nit: This name is a bit too similar to UpdateHostStats() and they seem to s Done http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/scheduling/admission-controller.cc File be/src/scheduling/admission-controller.cc: http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/scheduling/admission-controller.cc@408 PS7, Line 408: PrintId(schedule.query_id())); > Should we add a continue; in this case ? Done http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/scheduling/admission-controller.cc@406 PS7, Line 406: DCHECK(false) << strings::Substitute( : "Error: Cannot find host address $0 for query $1.", PrintThrift(host_addr), : PrintId(schedule.query_id())); > Shouldn't this be LOG(ERROR) or something in addition to DCHECK(false); ? Done http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/scheduling/admission-controller.cc@1531 PS7, Line 1531: } > DCHECK_EQ(num_release_backends_.find(schedule->query_id()), num_release_bac Done http://gerrit.cloudera.org:8080/#/c/14104/7/tests/custom_cluster/test_executor_groups.py File tests/custom_cluster/test_executor_groups.py: http://gerrit.cloudera.org:8080/#/c/14104/7/tests/custom_cluster/test_executor_groups.py@236 PS7, Line 236: 3 > Why this change ? This test started failing because they assume a max number of queries can be run within an executor group and enforce this by setting -max_concurrent_queries. The issue is that they only set -max_concurrent_queries for executors, and set a much higher value for coordinators. That causes problems with this patch because backends might be released independently (e.g. executor backends might be released first, and then coordinator backends). This breaks the assertions in https://github.com/apache/impala/blob/master/tests/custom_cluster/test_auto_scaling.py#L173). Specifically, what happens is that because there 5 streams, queries start getting queued up. Then, executor Backends start getting released (even though the query itself is still running and the Coordinator Backend has not been released yet). Then a query gets de-queued and submitted because there are enough executor slots (since they have been released) and enough coordinator slots (since the value is set to 16). http://gerrit.cloudera.org:8080/#/c/14104/7/tests/custom_cluster/test_result_spooling.py File tests/custom_cluster/test_result_spooling.py: http://gerrit.cloudera.org:8080/#/c/14104/7/tests/custom_cluster/test_result_spooling.py@62 PS7, Line 62: self.wait_for_state(handle, self.client.QUERY_STATES['FINISHED'], timeout) > Is there a potential race here ? If I understand it correctly, we may reach Yeah there is, I added a sleep here. Once IMPALA-8825 is merged we can make this less racy because it adds metrics to count the number of spooled rows (if we had this we could loop until number of spooled rows = 2000). -- To view, visit http://gerrit.cloudera.org:8080/14104 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I88bb11e0ede7574568020e0277dd8ac8d2586dc9 Gerrit-Change-Number: 14104 Gerrit-PatchSet: 7 Gerrit-Owner: Sahil Takiar <[email protected]> Gerrit-Reviewer: Bikramjeet Vig <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Sahil Takiar <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Thu, 05 Sep 2019 20:28:27 +0000 Gerrit-HasComments: Yes
