Re: [PR] [YUNIKORN-2761] Explain preemption storm in usage doc [yunikorn-site]

via GitHub Thu, 18 Jul 2024 02:34:49 -0700


wilfred-s commented on code in PR #457:
URL: https://github.com/apache/yunikorn-site/pull/457#discussion_r1682486677



##########
docs/user_guide/preemption.md:
##########
@@ -228,4 +228,31 @@ In this example, two imbalances are observed:
 | `rt.ten-a.queue-2` | 0                          | 0                         |
 | `rt.ten-b`         | 15                         | 10                        |
 | `rt.ten-b.queue-3` | 15                         | 10                        |
-| `rt.sys`           | 0                          | 10                        |
\ No newline at end of file
+| `rt.sys`           | 0                          | 10                        |
+
+### Redistribution of Quota and Preemption Storm
+
+#### Redistribution of Quota
+
+Setting up guaranteed resources for the queue present at a higher level in the 
whole queue hierarchy helps to re-distribute the quota among different groups 
especially when workloads of the same priority run in different groups. Unlike 
the default scheduler, Yunikorn preempts even the workloads of the same 
priority to free up resources for pending workloads who deserve to get the 
resources as per guaranteed quota. At times, one needs this kind of queue set 
up in a real production cluster for redistribution.

Review Comment:
   drop 'whole' from "the whole queue"
   start new sentence at "especially"
   Yunikorn with a capital K
   drop 'even the' from "even the workloads"
   add "the queues' " to "as per the queues' guaranteed quota"
   set up is one word "setup"



##########
docs/user_guide/preemption.md:
##########
@@ -228,4 +228,31 @@ In this example, two imbalances are observed:
 | `rt.ten-a.queue-2` | 0                          | 0                         |
 | `rt.ten-b`         | 15                         | 10                        |
 | `rt.ten-b.queue-3` | 15                         | 10                        |
-| `rt.sys`           | 0                          | 10                        |
\ No newline at end of file
+| `rt.sys`           | 0                          | 10                        |
+
+### Redistribution of Quota and Preemption Storm
+
+#### Redistribution of Quota
+
+Setting up guaranteed resources for the queue present at a higher level in the 
whole queue hierarchy helps to re-distribute the quota among different groups 
especially when workloads of the same priority run in different groups. Unlike 
the default scheduler, Yunikorn preempts even the workloads of the same 
priority to free up resources for pending workloads who deserve to get the 
resources as per guaranteed quota. At times, one needs this kind of queue set 
up in a real production cluster for redistribution.
+
+For example, root.region[1-N].country[1-N].state[1-N]
+
+![preemption_quota_redistribution](../assets/preemption_quota_redistribution.png)
+
+This queue set up has N regions under “root”, each region has N countries. If 
administrators want to redistribute the workloads of the same priority among 
different regions, then it is better to define the guaranteed quota for each 
region so that preemption helps to reach the situation of running the workloads 
by redistribution based on the guaranteed quota each region is supposed to get. 
That way each region uses the resources it deserves to get at the maximum 
possible level from the overall cluster resources.

Review Comment:
   set up is one word "setup"



##########
docs/user_guide/preemption.md:
##########
@@ -228,4 +228,31 @@ In this example, two imbalances are observed:
 | `rt.ten-a.queue-2` | 0                          | 0                         |
 | `rt.ten-b`         | 15                         | 10                        |
 | `rt.ten-b.queue-3` | 15                         | 10                        |
-| `rt.sys`           | 0                          | 10                        |
\ No newline at end of file
+| `rt.sys`           | 0                          | 10                        |
+
+### Redistribution of Quota and Preemption Storm
+
+#### Redistribution of Quota
+
+Setting up guaranteed resources for the queue present at a higher level in the 
whole queue hierarchy helps to re-distribute the quota among different groups 
especially when workloads of the same priority run in different groups. Unlike 
the default scheduler, Yunikorn preempts even the workloads of the same 
priority to free up resources for pending workloads who deserve to get the 
resources as per guaranteed quota. At times, one needs this kind of queue set 
up in a real production cluster for redistribution.
+
+For example, root.region[1-N].country[1-N].state[1-N]
+
+![preemption_quota_redistribution](../assets/preemption_quota_redistribution.png)
+
+This queue set up has N regions under “root”, each region has N countries. If 
administrators want to redistribute the workloads of the same priority among 
different regions, then it is better to define the guaranteed quota for each 
region so that preemption helps to reach the situation of running the workloads 
by redistribution based on the guaranteed quota each region is supposed to get. 
That way each region uses the resources it deserves to get at the maximum 
possible level from the overall cluster resources.
+
+#### Preemption Storm
+
+With setup like above, there is a side effect of increasing the possibilities 
of preemption storm or loop happening within the specific region between 
different state queues (siblings belonging to same parent).
+
+ReplicaSets are a good example to look at for looping and circular preemption. 
Each time a pod from a replica set is removed the ReplicaSet controller will 
create a new pod to make sure the set is complete. That auto-recreation could 
trigger loops as described below.
+
+![preemption_storm](../assets/preemption_storm.png)
+
+Replica set <i>State1 Repl</i> runs in queue <i>State1</i>. Replica set 
<i>State2 Repl</i> runs in the queue <i>State2</i>. Both queues belong to the 
same parent queue (they are siblings), <i>Country1</i>. The pods all run with 
the same settings for priority and preemption. There is no space left on the 
cluster. <i>State1</i> has no guaranteed quota, 4  pods of each vcores:1 are 
running and multiple pods of each vcores:1 of the replica set are pending. 
<i>State2</i> has no guaranteed quota, 4 pods of each vcores:1 are running and 
multiple pods of each vcores:1 of the replica set are pending. Both region, 
<i>region1</i> and country, <i>country1</i> queue usage is vcores:4. Since 
<i>region1</i> has a guaranteed quota of vcores:10 and usage of vcores:8 lower 
than its guaranteed quota leading to starvation of resources. All the queues 
(including both direct or indirect) below the parent queue are starving as it 
inherits the “under guaranteed” behavior from above said parent queue, <
 i>region1</i> calculation unless each state (leaf) queue has its own 
guaranteed quota. Now, either one of these state queues can trigger preemption. 
+
+Let's say, <i>state1</i> triggers preemption to meet resource requirements for 
pending pods.
+To make room for a <i>State1 Repl</i> pod, a pod from the <i>State2 Repl</i> 
set is preempted. Now, the pending <i>State1 Repl</i> pod moves from pending to 
running. Now, the next scheduling cycle comes. Let's say, <i>State2</i> 
triggers preemption to meet resource requirements for its pending pods. In 
addition to already existing pending pods, pod preempted (killed) in earlier 
scheduling cycles would have been recreated automatically by this time as it is 
a replica set. To make room for a <i>State2 Repl</i> pod, a pod from the 
<i>State1 Repl</i> set is preempted. Now, the pending <i>State2 Repl</i> pod 
moves from pending to running and preempted (killed) pod belonging to <i>State1 
Repl</i> set would be recreated again. Now, the next scheduling cycle comes. 
Again, the whole loop repeats killing each other from the siblings without 
going anywhere leading to a preemption storm causing instability of the queues.
+

Review Comment:
   Add an extension on this: it could even happen for a child queue below 
country 2 that gets caught in the preemption storm.  



##########
docs/user_guide/preemption.md:
##########
@@ -228,4 +228,31 @@ In this example, two imbalances are observed:
 | `rt.ten-a.queue-2` | 0                          | 0                         |
 | `rt.ten-b`         | 15                         | 10                        |
 | `rt.ten-b.queue-3` | 15                         | 10                        |
-| `rt.sys`           | 0                          | 10                        |
\ No newline at end of file
+| `rt.sys`           | 0                          | 10                        |
+
+### Redistribution of Quota and Preemption Storm
+
+#### Redistribution of Quota
+
+Setting up guaranteed resources for the queue present at a higher level in the 
whole queue hierarchy helps to re-distribute the quota among different groups 
especially when workloads of the same priority run in different groups. Unlike 
the default scheduler, Yunikorn preempts even the workloads of the same 
priority to free up resources for pending workloads who deserve to get the 
resources as per guaranteed quota. At times, one needs this kind of queue set 
up in a real production cluster for redistribution.
+
+For example, root.region[1-N].country[1-N].state[1-N]
+
+![preemption_quota_redistribution](../assets/preemption_quota_redistribution.png)
+
+This queue set up has N regions under “root”, each region has N countries. If 
administrators want to redistribute the workloads of the same priority among 
different regions, then it is better to define the guaranteed quota for each 
region so that preemption helps to reach the situation of running the workloads 
by redistribution based on the guaranteed quota each region is supposed to get. 
That way each region uses the resources it deserves to get at the maximum 
possible level from the overall cluster resources.
+
+#### Preemption Storm
+
+With setup like above, there is a side effect of increasing the possibilities 
of preemption storm or loop happening within the specific region between 
different state queues (siblings belonging to same parent).

Review Comment:
   add "a" to "With a setup"
   replace "possibilities" with "chance"
   add "a" to "a preemption storm or loop"



##########
docs/user_guide/preemption.md:
##########
@@ -228,4 +228,31 @@ In this example, two imbalances are observed:
 | `rt.ten-a.queue-2` | 0                          | 0                         |
 | `rt.ten-b`         | 15                         | 10                        |
 | `rt.ten-b.queue-3` | 15                         | 10                        |
-| `rt.sys`           | 0                          | 10                        |
\ No newline at end of file
+| `rt.sys`           | 0                          | 10                        |
+
+### Redistribution of Quota and Preemption Storm
+
+#### Redistribution of Quota
+
+Setting up guaranteed resources for the queue present at a higher level in the 
whole queue hierarchy helps to re-distribute the quota among different groups 
especially when workloads of the same priority run in different groups. Unlike 
the default scheduler, Yunikorn preempts even the workloads of the same 
priority to free up resources for pending workloads who deserve to get the 
resources as per guaranteed quota. At times, one needs this kind of queue set 
up in a real production cluster for redistribution.
+
+For example, root.region[1-N].country[1-N].state[1-N]
+
+![preemption_quota_redistribution](../assets/preemption_quota_redistribution.png)
+
+This queue set up has N regions under “root”, each region has N countries. If 
administrators want to redistribute the workloads of the same priority among 
different regions, then it is better to define the guaranteed quota for each 
region so that preemption helps to reach the situation of running the workloads 
by redistribution based on the guaranteed quota each region is supposed to get. 
That way each region uses the resources it deserves to get at the maximum 
possible level from the overall cluster resources.
+
+#### Preemption Storm
+
+With setup like above, there is a side effect of increasing the possibilities 
of preemption storm or loop happening within the specific region between 
different state queues (siblings belonging to same parent).
+
+ReplicaSets are a good example to look at for looping and circular preemption. 
Each time a pod from a replica set is removed the ReplicaSet controller will 
create a new pod to make sure the set is complete. That auto-recreation could 
trigger loops as described below.
+
+![preemption_storm](../assets/preemption_storm.png)
+
+Replica set <i>State1 Repl</i> runs in queue <i>State1</i>. Replica set 
<i>State2 Repl</i> runs in the queue <i>State2</i>. Both queues belong to the 
same parent queue (they are siblings), <i>Country1</i>. The pods all run with 
the same settings for priority and preemption. There is no space left on the 
cluster. <i>State1</i> has no guaranteed quota, 4  pods of each vcores:1 are 
running and multiple pods of each vcores:1 of the replica set are pending. 
<i>State2</i> has no guaranteed quota, 4 pods of each vcores:1 are running and 
multiple pods of each vcores:1 of the replica set are pending. Both region, 
<i>region1</i> and country, <i>country1</i> queue usage is vcores:4. Since 
<i>region1</i> has a guaranteed quota of vcores:10 and usage of vcores:8 lower 
than its guaranteed quota leading to starvation of resources. All the queues 
(including both direct or indirect) below the parent queue are starving as it 
inherits the “under guaranteed” behavior from above said parent queue, <
 i>region1</i> calculation unless each state (leaf) queue has its own 
guaranteed quota. Now, either one of these state queues can trigger preemption. 

Review Comment:
   use backquotes for names like`State1 Repl` instead of `<i>State1 repl</i>` 
to stick with the md flow and not make it HTML
   
   Describe the state of each queue as a bullet list  



##########
docs/user_guide/preemption.md:
##########
@@ -228,4 +228,31 @@ In this example, two imbalances are observed:
 | `rt.ten-a.queue-2` | 0                          | 0                         |
 | `rt.ten-b`         | 15                         | 10                        |
 | `rt.ten-b.queue-3` | 15                         | 10                        |
-| `rt.sys`           | 0                          | 10                        |
\ No newline at end of file
+| `rt.sys`           | 0                          | 10                        |
+
+### Redistribution of Quota and Preemption Storm
+
+#### Redistribution of Quota
+
+Setting up guaranteed resources for the queue present at a higher level in the 
whole queue hierarchy helps to re-distribute the quota among different groups 
especially when workloads of the same priority run in different groups. Unlike 
the default scheduler, Yunikorn preempts even the workloads of the same 
priority to free up resources for pending workloads who deserve to get the 
resources as per guaranteed quota. At times, one needs this kind of queue set 
up in a real production cluster for redistribution.
+
+For example, root.region[1-N].country[1-N].state[1-N]
+
+![preemption_quota_redistribution](../assets/preemption_quota_redistribution.png)
+
+This queue set up has N regions under “root”, each region has N countries. If 
administrators want to redistribute the workloads of the same priority among 
different regions, then it is better to define the guaranteed quota for each 
region so that preemption helps to reach the situation of running the workloads 
by redistribution based on the guaranteed quota each region is supposed to get. 
That way each region uses the resources it deserves to get at the maximum 
possible level from the overall cluster resources.
+
+#### Preemption Storm
+
+With setup like above, there is a side effect of increasing the possibilities 
of preemption storm or loop happening within the specific region between 
different state queues (siblings belonging to same parent).
+
+ReplicaSets are a good example to look at for looping and circular preemption. 
Each time a pod from a replica set is removed the ReplicaSet controller will 
create a new pod to make sure the set is complete. That auto-recreation could 
trigger loops as described below.
+
+![preemption_storm](../assets/preemption_storm.png)
+
+Replica set <i>State1 Repl</i> runs in queue <i>State1</i>. Replica set 
<i>State2 Repl</i> runs in the queue <i>State2</i>. Both queues belong to the 
same parent queue (they are siblings), <i>Country1</i>. The pods all run with 
the same settings for priority and preemption. There is no space left on the 
cluster. <i>State1</i> has no guaranteed quota, 4  pods of each vcores:1 are 
running and multiple pods of each vcores:1 of the replica set are pending. 
<i>State2</i> has no guaranteed quota, 4 pods of each vcores:1 are running and 
multiple pods of each vcores:1 of the replica set are pending. Both region, 
<i>region1</i> and country, <i>country1</i> queue usage is vcores:4. Since 
<i>region1</i> has a guaranteed quota of vcores:10 and usage of vcores:8 lower 
than its guaranteed quota leading to starvation of resources. All the queues 
(including both direct or indirect) below the parent queue are starving as it 
inherits the “under guaranteed” behavior from above said parent queue, <
 i>region1</i> calculation unless each state (leaf) queue has its own 
guaranteed quota. Now, either one of these state queues can trigger preemption. 
+
+Let's say, <i>state1</i> triggers preemption to meet resource requirements for 
pending pods.
+To make room for a <i>State1 Repl</i> pod, a pod from the <i>State2 Repl</i> 
set is preempted. Now, the pending <i>State1 Repl</i> pod moves from pending to 
running. Now, the next scheduling cycle comes. Let's say, <i>State2</i> 
triggers preemption to meet resource requirements for its pending pods. In 
addition to already existing pending pods, pod preempted (killed) in earlier 
scheduling cycles would have been recreated automatically by this time as it is 
a replica set. To make room for a <i>State2 Repl</i> pod, a pod from the 
<i>State1 Repl</i> set is preempted. Now, the pending <i>State2 Repl</i> pod 
moves from pending to running and preempted (killed) pod belonging to <i>State1 
Repl</i> set would be recreated again. Now, the next scheduling cycle comes. 
Again, the whole loop repeats killing each other from the siblings without 
going anywhere leading to a preemption storm causing instability of the queues.
+
+Defining guaranteed resources at queues at lower level or at end leaf queues 
can avoid the preemption storm or loop happening in the cluster. Administrators 
should be aware of the side effects of setting up guaranteed resources at any 
specific location in the whole queue hierarchy to reap the best possible 
outcomes of the preemption process.

Review Comment:
   add "from" to "or loop from happening"
   drop "whole" from "the whole queue"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [YUNIKORN-2761] Explain preemption storm in usage doc [yunikorn-site]

Reply via email to