Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14104 )

Change subject: IMPALA-8803: Coordinator should release admitted memory 
per-backend
......................................................................


Patch Set 7:

(15 comments)

http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/runtime/coordinator-backend-resource-state.cc
File be/src/runtime/coordinator-backend-resource-state.cc:

http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/runtime/coordinator-backend-resource-state.cc@81
PS7, Line 81: num_pending_releasable_or_released_
> Would it be simpler to track the negation (i.e. num_in_use_) instead ? We o
Done


http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/runtime/coordinator-backend-resource-state.cc@95
PS7, Line 95: is_coordinator_the_last_unreleased_backend
> If I understand it correctly, there is no adverse effect for this heuristic
Not sure I follow. Can a query run without a coordinator fragment?

If there is no coordinator fragment, then this heuristic will always be false 
since there is no BackendState where 'is_coord_backend' is true. Although this 
won't affect the other heuristics. Worse case, that just means releasing 
admission control resources is delayed until the query is closed, which is what 
the current code does anyway.


http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/runtime/coordinator-backend-resource-state.cc@106
PS7, Line 106:       }
> DCHECK_GE(num_pending_, 0);
Done


http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/runtime/coordinator-backend-resource-state.cc@117
PS7, Line 117: DCHECK(
> DCHECK_NE
Done


http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/runtime/coordinator-backend-state.h
File be/src/runtime/coordinator-backend-state.h:

http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/runtime/coordinator-backend-state.h@457
PS7, Line 457: PENDING state
> or RELESABLE ?
I've revised the method description to make things a bit clearer.


http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/runtime/coordinator-backend-state.h@460
PS7, Line 460:   /// released.
> A no-op if the BackendResourceState is closed already.
Done


http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/runtime/coordinator-backend-state.h@465
PS7, Line 465:   /// BackendStates have been released.
> This may be called even after CloseAndGetUnreleasedBackends().
Yes, revised the method comment.


http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/runtime/coordinator-backend-state.h@509
PS7, Line 509: BackendResourceState is closed,
> May help to clarify what may and may not happen after transitioning to the
I've updated the method comments above, the class comment also has some notes 
about this. Let me know if you think those are sufficient, otherwise I can add 
some more.


http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/scheduling/admission-controller.h
File be/src/scheduling/admission-controller.h:

http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/scheduling/admission-controller.h@851
PS7, Line 851: specificied
> typo
Done


http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/scheduling/admission-controller.h@852
PS7, Line 852: UpdateStatsForHost(
> nit: This name is a bit too similar to UpdateHostStats() and they seem to s
Done


http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/scheduling/admission-controller.cc
File be/src/scheduling/admission-controller.cc:

http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/scheduling/admission-controller.cc@408
PS7, Line 408:           PrintId(schedule.query_id()));
> Should we add a continue; in this case ?
Done


http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/scheduling/admission-controller.cc@406
PS7, Line 406:       DCHECK(false) << strings::Substitute(
             :           "Error: Cannot find host address $0 for query $1.", 
PrintThrift(host_addr),
             :           PrintId(schedule.query_id()));
> Shouldn't this be LOG(ERROR) or something in addition to DCHECK(false); ?
Done


http://gerrit.cloudera.org:8080/#/c/14104/7/be/src/scheduling/admission-controller.cc@1531
PS7, Line 1531:   }
> DCHECK_EQ(num_release_backends_.find(schedule->query_id()), num_release_bac
Done


http://gerrit.cloudera.org:8080/#/c/14104/7/tests/custom_cluster/test_executor_groups.py
File tests/custom_cluster/test_executor_groups.py:

http://gerrit.cloudera.org:8080/#/c/14104/7/tests/custom_cluster/test_executor_groups.py@236
PS7, Line 236: 3
> Why this change ?
This test started failing because they assume a max number of queries can be 
run within an executor group and enforce this by setting 
-max_concurrent_queries. The issue is that they only set 
-max_concurrent_queries for executors, and set a much higher value for 
coordinators. That causes problems with this patch because backends might be 
released independently (e.g. executor backends might be released first, and 
then coordinator backends). This breaks the assertions in 
https://github.com/apache/impala/blob/master/tests/custom_cluster/test_auto_scaling.py#L173).

Specifically, what happens is that because there 5 streams, queries start 
getting queued up. Then, executor Backends start getting released (even though 
the query itself is still running and the Coordinator Backend has not been 
released yet). Then a query gets de-queued and submitted because there are 
enough executor slots (since they have been released) and enough coordinator 
slots (since the value is set to 16).


http://gerrit.cloudera.org:8080/#/c/14104/7/tests/custom_cluster/test_result_spooling.py
File tests/custom_cluster/test_result_spooling.py:

http://gerrit.cloudera.org:8080/#/c/14104/7/tests/custom_cluster/test_result_spooling.py@62
PS7, Line 62: self.wait_for_state(handle, self.client.QUERY_STATES['FINISHED'], 
timeout)
> Is there a potential race here ? If I understand it correctly, we may reach
Yeah there is, I added a sleep here. Once IMPALA-8825 is merged we can make 
this less racy because it adds metrics to count the number of spooled rows (if 
we had this we could loop until number of spooled rows = 2000).



--
To view, visit http://gerrit.cloudera.org:8080/14104
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I88bb11e0ede7574568020e0277dd8ac8d2586dc9
Gerrit-Change-Number: 14104
Gerrit-PatchSet: 7
Gerrit-Owner: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Bikramjeet Vig <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Sahil Takiar <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Thu, 05 Sep 2019 20:28:27 +0000
Gerrit-HasComments: Yes

Reply via email to