[Impala-ASF-CR] IMPALA-12980: Translate CpuAsk into admission control slots

Riza Suminto (Code Review) Tue, 16 Apr 2024 08:45:49 -0700

Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21257 )


Change subject: IMPALA-12980: Translate CpuAsk into admission control slots
......................................................................


Patch Set 13:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/21257/11//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/21257/11//COMMIT_MSG@19
PS11, Line 19:  rather
             : than sum of it (48)
> So if the configuration is correct, then cpuask will be always <= the number 
> of cores?

Yes, CpuAsk should be <= total cores in selected executor group set. An 
exception is when CpuAsk still larger than biggest executor group set. In that 
case, we assign it anyway to largest executor group set since the biggest one 
is treated as "catch-all" group.

> I think that what needs a bit of explanation is "12 cores oversubscribed by 
> 4x".

Lets assume that an executor host has exactly 48 cpu cores. What I want to say 
with the example is that, if an executor is assigned with 48 fragments 
instances that all can run in-parallel without blocking each other, they should 
be assigned 48 slots rather than 12 slots in that executor node.
Note that admission control slots inherently control how many concurrent query 
can run in a cluster. If this query is given 12 slots, then there can be 4 same 
queries running concurrently. But if query is given 48 slots, only 1 query can 
run at a time.


http://gerrit.cloudera.org:8080/#/c/21257/11/tests/custom_cluster/test_executor_groups.py
File tests/custom_cluster/test_executor_groups.py:

http://gerrit.cloudera.org:8080/#/c/21257/11/tests/custom_cluster/test_executor_groups.py@1245
PS11, Line 1245:     #   CoreCount={total=16 
trace=F15:3+F01:1+F14:3+F03:1+F13:3+F05:1+F12:3+F07:1},
Right, tpcds_cpu_cost/tpcds-q01.test is confusing here because the test has 
stats injection. I will add a new Planner test without stats injection.

> Does the cost not matter in this case, so we give a parallel fragment a full 
> slot even if we estimate it to process just a couple of rows?

The role of processing cost stop after Planner select parallelism for each 
fragment. Naively, planner/scheduler can just total num fragment instances 
assigned to each node as slots requirement, but that will be wasteful if only 
subset of fragment can run in parallel while others blocked waiting. Therefore, 
planner/scheduler only select largest subset of fragments that does not block 
each other, and sum instances of that fragments subset as slot requirement.

> Another thing I don't get is that if F15 (a builder for a broadcast join) is 
> included, then why F02 is not included, which has the source scan node of F15 
> and should run in parallel? (+it has a much higher cost than F15)

In the new planner test that I will submit next, we will see that parallelism 
of F02 (3) is lower than parallelism of its child fragments F15 + F01 (3 + 1). 
F02 and F01 is a blocking fragment because they have AGGREGATE node in it. They 
also can not begin work until F15, the join builder, is complete.

F02:PLAN FRAGMENT [HASH(sr_customer_sk,sr_store_sk)] hosts=3 instances=3 
(adjusted from 384)
Per-Instance Resources: mem-estimate=10.49MB mem-reservation=1.94MB 
thread-reservation=1
max-parallelism=3 segment-costs=[327730, 110283] cpu-comparison-result=4 [max(3 
(self) vs 4 (sum children))]
17:AGGREGATE [FINALIZE]
|  output: sum:merge(SR_RETURN_AMT)
|  group by: sr_customer_sk, sr_store_sk
|  mem-estimate=10.00MB mem-reservation=1.94MB spill-buffer=64.00KB 
thread-reservation=0
|  tuple-ids=2 row-size=24B cardinality=53.52K cost=315877
|  in pipelines: 17(GETNEXT), 00(OPEN)
|
16:EXCHANGE [HASH(sr_customer_sk,sr_store_sk)]
|  mem-estimate=502.09KB mem-reservation=0B thread-reservation=0
|  tuple-ids=2 row-size=24B cardinality=53.52K cost=11853
|  in pipelines: 00(GETNEXT)
|
F00:PLAN FRAGMENT [RANDOM] hosts=3 instances=3 (adjusted from 384)
Per-Host Shared Resources: mem-estimate=4.00MB mem-reservation=4.00MB 
thread-reservation=0 runtime-filters-memory=4.00MB
Per-Instance Resources: mem-estimate=26.33MB mem-reservation=2.12MB 
thread-reservation=1
max-parallelism=3 segment-costs=[351629, 110283] cpu-comparison-result=4 [max(3 
(self) vs 4 (sum children))]
03:AGGREGATE [STREAMING]
|  output: sum(SR_RETURN_AMT)
|  group by: sr_customer_sk, sr_store_sk
|  mem-estimate=10.00MB mem-reservation=2.00MB spill-buffer=64.00KB 
thread-reservation=0
|  tuple-ids=2 row-size=24B cardinality=53.52K cost=315877
|  in pipelines: 00(GETNEXT)
|
02:HASH JOIN [INNER JOIN, BROADCAST]
|  hash-table-id=04
|  hash predicates: sr_returned_date_sk = d_date_sk
|  fk/pk conjuncts: sr_returned_date_sk = d_date_sk
|  mem-estimate=0B mem-reservation=0B spill-buffer=64.00KB thread-reservation=0
|  tuple-ids=0,1 row-size=24B cardinality=53.52K cost=23423
|  in pipelines: 00(GETNEXT), 01(OPEN)
|
|--F15:PLAN FRAGMENT [RANDOM] hosts=3 instances=3
|  |  Per-Instance Resources: mem-estimate=2.95MB mem-reservation=2.94MB 
thread-reservation=1 runtime-filters-memory=1.00MB
|  |  max-parallelism=3 segment-costs=[520]
|  JOIN BUILD
...



--
To view, visit http://gerrit.cloudera.org:8080/21257
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I338ca96555bfe8d07afce0320b3688a0861663f2
Gerrit-Change-Number: 21257
Gerrit-PatchSet: 13
Gerrit-Owner: Riza Suminto <[email protected]>
Gerrit-Reviewer: Abhishek Rawat <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Kurt Deschler <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Reviewer: Wenzhe Zhou <[email protected]>
Gerrit-Comment-Date: Tue, 16 Apr 2024 15:45:07 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-12980: Translate CpuAsk into admission control slots

Reply via email to