Re: [PR] [YUNIKORN-2761] Explain preemption storm in usage doc [yunikorn-site]

via GitHub Sun, 21 Jul 2024 07:07:17 -0700


chenyulin0719 commented on code in PR #457:
URL: https://github.com/apache/yunikorn-site/pull/457#discussion_r1685744191



##########
docs/user_guide/preemption.md:
##########
@@ -228,4 +228,53 @@ In this example, two imbalances are observed:
 | `rt.ten-a.queue-2` | 0                          | 0                         |
 | `rt.ten-b`         | 15                         | 10                        |
 | `rt.ten-b.queue-3` | 15                         | 10                        |
-| `rt.sys`           | 0                          | 10                        |
\ No newline at end of file
+| `rt.sys`           | 0                          | 10                        |
+
+### Redistribution of Quota and Preemption Storm
+
+#### Redistribution of Quota
+
+Setting up guaranteed resources for the queue present at a higher level in the 
queue hierarchy helps to re-distribute the quota among different groups. 
Especially when workloads of the same priority run in different groups, unlike 
the default scheduler, YuniKorn preempts workloads of the same priority to free 
up resources for pending workloads who deserve to get the resources as per the 
queues guaranteed quota. At times, one needs this kind of queue setup in a real 
production cluster for redistribution.
+
+For example, root.region[1-N].country[1-N].state[1-N]
+
+![preemption_quota_redistribution](../assets/preemption_quota_redistribution.png)
+
+This queue setup has N regions under “root”, each region has N countries. If 
administrators want to redistribute the workloads of the same priority among 
different regions, then it is better to define the guaranteed quota for each 
region so that preemption helps to reach the situation of running the workloads 
by redistribution based on the guaranteed quota each region is supposed to get. 
That way each region uses the resources it deserves to get at the maximum 
possible level from the overall cluster resources.
+
+#### Preemption Storm
+
+With a setup like above, there is a side effect of increasing the chance of a 
preemption storm or loop happening within the specific region between different 
state queues (siblings belonging to same parent).
+
+ReplicaSets are a good example to look at for looping and circular preemption. 
Each time a pod from a replica set is removed the ReplicaSet controller will 
create a new pod to make sure the set is complete. That auto-recreation could 
trigger loops as described below.
+
+![preemption_storm](../assets/preemption_storm.png)
+
+State of the queues:
+
+#### `Region1`
+
+* Guaranteed: vcores = 10
+* Usage: vcores = 8
+* Under guaranteed: usage < guaranteed, starving
+
+#### `State1`
+
+* Guaranteed: nil
+* Usage: vcores = 4
+* Pending: vcores = 5
+* Inherits "under guaranteed" behaviour from `Region1`, eligible to trigger 
preemption

Review Comment:
   How about describe the ReplicaSet requirement in the queue state list?   For 
example:
   
   #### `State1`
   
   * Guaranteed: nil
   * **A ReplicaSet is submitted to queue and requesting 9 replicas, with each 
replica requiring `{vcore: 1}`.**
   * Usage: vcores = 4
   * Pending: vcores = 5
   * Inherits "under guaranteed" behaviour from `Region1`, eligible to trigger 
preemption
   
   
   Then we could remove the description below:  "4 pods of each vcores:1 are 
running and multiple pods of each vcores:1 of the replica set are pending."
   



##########
docs/user_guide/preemption.md:
##########
@@ -228,4 +228,53 @@ In this example, two imbalances are observed:
 | `rt.ten-a.queue-2` | 0                          | 0                         |
 | `rt.ten-b`         | 15                         | 10                        |
 | `rt.ten-b.queue-3` | 15                         | 10                        |
-| `rt.sys`           | 0                          | 10                        |
\ No newline at end of file
+| `rt.sys`           | 0                          | 10                        |
+
+### Redistribution of Quota and Preemption Storm
+
+#### Redistribution of Quota
+
+Setting up guaranteed resources for the queue present at a higher level in the 
queue hierarchy helps to re-distribute the quota among different groups. 
Especially when workloads of the same priority run in different groups, unlike 
the default scheduler, YuniKorn preempts workloads of the same priority to free 
up resources for pending workloads who deserve to get the resources as per the 
queues guaranteed quota. At times, one needs this kind of queue setup in a real 
production cluster for redistribution.
+
+For example, root.region[1-N].country[1-N].state[1-N]
+
+![preemption_quota_redistribution](../assets/preemption_quota_redistribution.png)
+
+This queue setup has N regions under “root”, each region has N countries. If 
administrators want to redistribute the workloads of the same priority among 
different regions, then it is better to define the guaranteed quota for each 
region so that preemption helps to reach the situation of running the workloads 
by redistribution based on the guaranteed quota each region is supposed to get. 
That way each region uses the resources it deserves to get at the maximum 
possible level from the overall cluster resources.
+
+#### Preemption Storm
+
+With a setup like above, there is a side effect of increasing the chance of a 
preemption storm or loop happening within the specific region between different 
state queues (siblings belonging to same parent).
+
+ReplicaSets are a good example to look at for looping and circular preemption. 
Each time a pod from a replica set is removed the ReplicaSet controller will 
create a new pod to make sure the set is complete. That auto-recreation could 
trigger loops as described below.
+
+![preemption_storm](../assets/preemption_storm.png)
+
+State of the queues:
+
+#### `Region1`
+
+* Guaranteed: vcores = 10
+* Usage: vcores = 8
+* Under guaranteed: usage < guaranteed, starving
+
+#### `State1`
+
+* Guaranteed: nil
+* Usage: vcores = 4
+* Pending: vcores = 5
+* Inherits "under guaranteed" behaviour from `Region1`, eligible to trigger 
preemption
+
+#### `State2`
+
+* Guaranteed: nil
+* Usage: vcores = 4
+* Pending: vcores = 5
+* Inherits "under guaranteed" behaviour from `Region1`, eligible to trigger 
preemption
+
+Replica set `State1 Repl` runs in queue `State1`. Replica set `State2 Repl` 
runs in the queue `State2`. Both queues belong to the same parent queue (they 
are siblings), `Country1`. The pods all run with the same settings for priority 
and preemption. There is no space left on the cluster. `State1` has no 
guaranteed quota, 4  pods of each vcores:1 are running and multiple pods of 
each vcores:1 of the replica set are pending. `State2` has no guaranteed quota, 
4 pods of each vcores:1 are running and multiple pods of each vcores:1 of the 
replica set are pending. Both region, `region1` and country, `country1` queue 
usage is vcores:4. Since `region1` has a guaranteed quota of vcores:10 and 
usage of vcores:8 lower than its guaranteed quota leading to starvation of 
resources. All the queues (including both direct or indirect) below the parent 
queue are starving as it inherits the “under guaranteed” behavior from above 
said parent queue, `region1` calculation unless each state (leaf) queu
 e has its own guaranteed quota. Now, either one of these state queues can 
trigger preemption. 

Review Comment:
   For "vcores:1" and "vcores:4",
   should we use backquotes here? ex:  `{vcore: 1}` and `{vcore: 4}`?  
   (Align with the resource description format in the other page: 
https://yunikorn.apache.org/docs/user_guide/use_cases#testing-3)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [YUNIKORN-2761] Explain preemption storm in usage doc [yunikorn-site]

Reply via email to