(yunikorn-site) branch master updated: [YUNIKORN-3134] Publish design doc (#538)

mani Thu, 22 Jan 2026 00:10:24 -0800

This is an automated email from the ASF dual-hosted git repository.

mani pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/yunikorn-site.git



The following commit(s) were added to refs/heads/master by this push:
     new 7d583cfcde [YUNIKORN-3134] Publish design doc (#538)
7d583cfcde is described below

commit 7d583cfcded320a6a8edf3f284c5e04e53338f50
Author: Manikandan R <[email protected]>
AuthorDate: Thu Jan 22 13:39:20 2026 +0530

    [YUNIKORN-3134] Publish design doc (#538)
    
    Closes: #538
    
    Signed-off-by: Manikandan R <[email protected]>
---
 ...tion_successful_best_effort_preemption_case.png | Bin 0 -> 39146 bytes
 ...quota_preemption_successful_preemption_case.png | Bin 0 -> 45757 bytes
 ...ota_preemption_unsuccessful_preemption_case.png | Bin 0 -> 33860 bytes
 docs/design/quota_preemptor.md                     | 223 +++++++++++++++++++++
 sidebars.js                                        |   1 +
 5 files changed, 224 insertions(+)

diff --git 
a/docs/assets/quota_preemption_successful_best_effort_preemption_case.png 
b/docs/assets/quota_preemption_successful_best_effort_preemption_case.png
new file mode 100644
index 0000000000..0f813b863e
Binary files /dev/null and 
b/docs/assets/quota_preemption_successful_best_effort_preemption_case.png differ
diff --git a/docs/assets/quota_preemption_successful_preemption_case.png 
b/docs/assets/quota_preemption_successful_preemption_case.png
new file mode 100644
index 0000000000..8efda70774
Binary files /dev/null and 
b/docs/assets/quota_preemption_successful_preemption_case.png differ
diff --git a/docs/assets/quota_preemption_unsuccessful_preemption_case.png 
b/docs/assets/quota_preemption_unsuccessful_preemption_case.png
new file mode 100644
index 0000000000..1eacf6bb6c
Binary files /dev/null and 
b/docs/assets/quota_preemption_unsuccessful_preemption_case.png differ
diff --git a/docs/design/quota_preemptor.md b/docs/design/quota_preemptor.md
new file mode 100644
index 0000000000..2edc8e6fe4
--- /dev/null
+++ b/docs/design/quota_preemptor.md
@@ -0,0 +1,223 @@
+---
+id: quota_change_enforcement_through_preemption
+title: Quota Enforcement through Preemption
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## Motivation
+
+Queue quota can be increased or decreased based on the need. In case of an 
increase, the scheduler works as normal and no change in behavior. As usual, it 
keeps permitting both the new and existing apps to run based on the newly 
configured quota. In case of an decrease, the scheduler stops accepting new 
workloads and waits for the running applications' natural exit so that newly 
configured quota comes into effect eventually. This behavior of natural exit is 
acceptable as long all the run [...]
+
+## Goals
+
+* Implement Preemption only for quota decrease.  
+* Honour Queue Guaranteed Resources   
+* Inter-Queue Preemption configuration should not be used in conjunction  
+* Queue Priority related configurations should not be used in conjunction
+
+## Non-Goals
+
+* Intra-Queue Preemption  
+* Cross Node Preemption
+
+## Quota Preemption Configuration
+
+Quota Enforcement through Preemption feature can be turned off globally by 
setting the   
+appropriate property at partition level. It is configurable as follows:
+
+```yaml
+partitions:  
+  - name: <name of the partition>  
+    preemption:  
+        quotapreemptionenabled: <boolean value>  
+    queues:  
+      - name: <name of the queue>  
+        resources:  
+          max: <maximum resources allowed for this queue>
+```
+
+Default is false (disabled). It means, the whole feature is turned OFF 
globally and preemption won’t be triggered when lowering the quota which is 
nothing but the situation exists today. So, having default as false is to 
retain the existing behavior as is.
+
+Setting it to true would turn ON the feature globally and preemption would be 
triggered whenever any queue quota decreases.
+
+## Quota Preemption Delay
+
+Quota Preemption Delay is the time duration after which the preemption should 
get triggered for quota changes. Otherwise, triggering preemption immediately 
has a profound impact especially on the queues where long live applications 
run. It is applicable only for quota decrease, nothing to do with increase. It 
is configurable.Unit is seconds.
+
+It is configurable as follows:
+
+```yaml
+partitions:  
+  - name: <name of the partition>  
+    queues:  
+      - name: <name of the queue>  
+        resources:  
+          max: <maximum resources allowed for this queue>  
+          quota.preemption.delay: <quota preemption delay in seconds>
+```
+It could be any value between 0 and maxint. 
+
+The default is 0. It means, preemption won’t be triggered when lowering the 
quota which is nothing but the situation exists today. As of now, In case of 
quota decrease, new quota would be applied and brought into effect only for the 
new requests during the next scheduling cycle but existing applications 
continue to run as is. So, having default as 0 is to retain the existing 
behavior as is.
+
+Setting any value between 60 seconds and 5 hours is preferable for most of the 
cases.
+
+Examples are
+
+1\) To trigger preemption after 2 hours
+
+```yaml
+partitions:  
+  - name: default  
+    queues:  
+      - name: queueA  
+        resources:  
+          max:  
+            {memory: 10G}  
+          quota.preemption.delay: 7200
+```
+
+2\) To trigger preemption immediately after 5 minutes
+
+```yaml
+partitions:  
+  - name: default  
+    queues:  
+      - name: queueA  
+        resources:  
+          max:  
+            {memory: 10G}  
+          quota.preemption.delay: 5
+```
+
+The delay timer clocks in once the config map has been saved. Config map could 
be modified for multiple scenarios, but the timer clocks in only when either 
Queue max resources or quota preemption delay property changes. Once the timer 
started, changes could happen to either one of these two properties. In case of 
any such changes, timers reset and start again from the beginning, not on top 
of the current timer value to avoid unnecessary confusions. Manipulation based 
on the current timer [...]
+In addition, changes could be made to more than one queue at the same time but 
with different delay values. In case of the same delay for leaf queue and other 
parent queue in the whole queue hierarchy, leaf queue could be prioritized over 
the parent as it might help the [Queue selection 
process](#queue-selection-and-ordering) for the parent as described later. Need 
not to go through further on this in detail here as these cases would be sorted 
out appropriately during the implementation.
+
+### Impact of Restart
+
+How does Quota Preemption Delay work after the quota changes followed up by 
restart? As explained earlier, the clock kicks in once the config map has been 
saved, say T1. Yunikorn restarted at T2. When service starts again after T2, 
Quota Preemption Delay would start again from the beginning and lead to 
postponing this activity due to lost time (T2-T1) as opposed to earlier 
schedule.
+
+## Meaningful Quota Decrease
+
+When Queue has both max resources and guaranteed resources set, decreasing 
quota to lower value should always be greater than guaranteed resources. 
Otherwise, it would violate the [goal](#goals) of considering guaranteed 
resources also into account and making sure usage doesn't fall below guaranteed 
resources as part of the preemption process. Preemption should not make usage 
fall below guaranteed resources at any cost. Config validation is required to 
ensure max resource is always great [...]
+
+## Preemptable Resource calculation
+
+Whenever any queue quota changes, the first thing would be to understand the 
usage. If the usage is below or equal to the new quota, no need to trigger 
preemption. Otherwise, preemption would be triggered. We need to calculate the 
resources that need to be preemptable in order to bring newer quotas into 
effect. So, a preemptable resource is the amount of resources that needs to be 
preempted in order to bring the new quota into effect immediately.
+
+Preemptable resource “preemptable resource” would be derived by subtracting 
the current usage from newly defined “max resources” quota using subtract 
method 
+
+resources.subOnlyExisting(existing, newer)
+
+as resource types might differ. Different possible scenarios are lowering all 
resource types, lowering specific resource types, complete elimination of 
specific resource types from its earlier configurations, brand new resource 
types not present in its earlier configurations etc. Using the above said 
resource subtraction method would guarantee the reliable delta calculation.
+
+Results could contain all positive values or negative values or combinations 
of both. Results derived containing negative value for any resource type are 
the only ones that need to be preempted. Nothing needs to be done for all other 
resource types as usage should be either equal or below the corresponding 
resource types in ““max resources” value. So, “preemptable resource” contains 
only resource types which need to be preempted.
+
+Below table covers the various possible scenarios for the queue `root.a.b` 
with examples:
+
+|Queue| Current Max           | New Max               | Usage                 
| Preemptable Resources |
+|--|-----------------------|-----------------------|-----------------------|-----------------------|
+| `root.a.b` | \{M: 100G\}           | \{M: 50G\}            | \{M: 80G\}      
      | \{M: 30G\}            |
+|  | Nil                   | \{M: 50G\}            | \{M: 80G\}            | 
\{M: 30G\}            |
+|  | \{M: 100G, CPU: 100\} | \{M: 50G, CPU: 50\}   | \{M: 80G, CPU: 80\}   | 
\{M: 30G, CPU: 30\}   |
+|  | \{M: 100G, CPU: 100\} | \{M:200G, CPU: 50\}   | \{M: 100G, CPU: 80\}  | 
\{CPU: 30\}           |
+|  | \{M: 100G\}           | \{M: 100G, CPU: 100\} | \{M: 50G, CPU: 500\}  | 
\{CPU: 400\}          |
+|  | \{M: 100G, CPU: 100\} | \{M: 50G\}            | \{M: 80G, CPU: 100\}  | 
\{M: 30G\}            |
+|  | \{M: 100G\}           | \{CPU: 100\}          | \{M: 100G, CPU: 500\} | 
\{CPU: 400\}          |
+
+## Lowering Leaf Queue Quota
+
+Whenever any leaf queue quota changes, the above described preemptable 
resource should be used as the target. Once we know the target, the next step 
is to go over the tasks running in the queue being worked upon to choose the 
potential victims.
+
+### How are victims selected and processed?
+
+Collect all running tasks based on below described criteria:
+
+1. Ignore Daemon set pods  
+2. Choose pods which have “at least one” match with preemptable resources.
+
+Once the set of tasks been collected, same has to be sorted based on the below 
described preference rules:
+
+1. Prefer Non originator pods over originator  
+2. In case of tie in rule \#1, prefer pods based on priority. Lower priority 
picked up first followed by high priority pods.  
+3. In case of tie in rule \#2, Prefer pods for which the “allowPreemption” 
flag has been set to true over other pods.  
+4. In case of tie in rule \#3, prefer “young” pods based on the age  
+   
+
+Once the tasks are sorted, tasks can be traversed one by one and preempted to 
free up resources until we reach the target.
+
+## Lowering Parent Queue Quota
+
+Whenever any leaf queue quota changes, we know the specific queue to work on. 
Whereas, in case of lowering parent queue quota, we need to traverse all the 
way down to collect the end leaf queues because it is the place where actual 
tasks need to be preempted. But how do we select the queue(s) among different 
queue paths located under the parent queue for which quota has been decreased?
+
+### Queue selection and ordering
+
+Child Queues with nil usage and queues for which [preemption is already in 
progress](#quota-preemption-delay)  would be eliminated in the beginning itself 
to avoid participation in below described steps. Once elimination is over, an 
above calculated preemptable resource could be distributed fairly among the 
remaining child queues as described below.
+
+Usage above the guaranteed resources only is allowed to be preempted to ensure 
that this preemption doesn’t end up in taking usage below the guaranteed 
resources. So, Usage above Guaranteed resources is calculated and curved out 
from each child queue and termed as “releasable resources”. It is nothing but 
subtracting the usage from the configured guaranteed resources. Guaranteed 
resources might have been configured only for a few types but not for all. So, 
all different combinations are  [...]
+
+Once “releasable resources” is derived, calculate “total releasable resources” 
by adding all child queues “releasable resources”. After that “releasable 
resources %” for each child queue could be derived by dividing its “releasable 
resources” by “total releasable resources”. “releasable resources %” might have 
different values for each resource type as it all depends on the guaranteed 
resources configuration and usages. Now, applying “releasable resources %” on 
above calculated preemptab [...]
+
+Honoring Guaranteed resources on one specific queue would increase or decrease 
the preemption rate on other siblings is a side effect caused by the above 
shift which is something admin should take into account while doing quota 
readjustments. We can introduce a configuration to decide the percentage of 
distribution among queues or queue based individual preemption property to 
permit the maximum preemptable resource later.
+
+Since we know the preemptable resource target for each leaf queue located 
under the current parent queue being worked upon, the next step is to repeat 
the same process for all levels until we reach the leaf queues branched from 
different queues at different levels. As always, a recursive approach would be 
followed here too. After reaching the end leaf queues, next step is to follow 
the steps described in [victims selection 
process](#how-are-victims-selected-and-processed?)
+
+Queue selection process prior to doing [victims selection 
process](#how-are-victims-selected-and-processed?) has a disadvantage of not 
doing sorting based on pod significance across all queues. However, given the 
advantage as described above, Queue selection ensures fairness in the 
preemption process triggered through lowering of parent queue quota.
+
+## “Preemption does not help” situations
+
+Resources used by each victim may or may not differ. Selecting victims from 
potential victims involves necessary checks, one among them is to ensure usage 
doesn’t go below the guaranteed resources. So, there could be chances that 
there are no viable options to proceed further. For example, Preemptable 
resources is \{memory: 40 GB\}, Guaranteed resources: \{memory: 50 GB\} and 
Usage is \{memory: 100 GB\}. 5 victims. Each victim is using \{memory: 20  
GB\}. Preempting 2 victims is not a pr [...]
+Unlike Intra queue preemption, where in asks (preemptor) goes through 
preemption process every scheduling cycle with an interval of  \`X\` seconds 
and probability of preemption yielding positive outcome after multiple attempts 
is higher, this preemption process doesn’t have a such inherent retry 
capabilities to increase the probability of successful preemption. Even if it 
is available, it is not going to help much as running victims are long lived 
workloads and not going to change immedi [...]
+
+However, there could be situations where the above explained “best effort” 
might not be applied to get best out of the process as it depends on the no. of 
potential victims available at any moment. If there are not enough victims 
available, maybe only 1 victim, preemption does not help at any cost as 
attempting further would make usage go below the guaranteed resources.
+
+Administrators should be informed about all these kinds of situations to avoid 
surprises through well defined mechanisms.
+
+![quota_preemption_successful_preemption_case](./../assets/quota_preemption_successful_preemption_case.png)
+
+![quota_preemption_successful_best_effort_preemption_case](./../assets/quota_preemption_successful_best_effort_preemption_case.png)
+
+![quota_preemption_unsuccessful_preemption_case](./../assets/quota_preemption_unsuccessful_preemption_case.png)
+
+## Relevance of existing Intra Queue Preemption Configurations and Queue Setups
+
+In short, all existing Intra Queue Preemption Configurations have no relevance 
from this design point of view and should not be used in conjunction with newly 
added configurations.
+
+### Preemption Fence & Preemption Delay
+
+Existing Preemption related properties like preemption.policy, 
preemption.delay should not be considered or used in conjunction with this 
[preemption](#motivation) as the objective is completely different from the 
other one. In addition, the reason for not considering preemption.policy 
(fence) into account especially for the child queues underneath the current 
parent queue being worked upon is to avoid unbiased and incorrect decisions. A 
fence is a unidirectional way of traversing down t [...]
+
+Queue=\>properties=\>preemption.delay should not be confused with above 
discussed Quota Change Preemption delay 
[Queue=\>resources=\>preemption.delay](#quota-preemption-configuration) and 
treated differently as earlier has been introduced to compensate for the 
scheduling cycle interval and used only in the scheduling cycle core path.
+
+### Priority Fence & Offset
+
+priority.policy (fence) & priority.offset makes an impact while working across 
different queues. These two properties have been used in Inter queue preemption 
before choosing the victims from queues located under different queue paths. 
Whereas, this preemption looks inside the queue or queues located underneath 
the parent queue being worked upon to apply quota change. So, above said 
priority related properties need not to be considered.
+
+### Dynamic Queues
+
+When a Dynamic queue is created, it uses the child template defined on its 
immediate queue or any ancestor queue in the whole queue hierarchy. When any 
lowering quota happens on existing child templates, the currently used dynamic 
queue can be left as is as the queue itself has a shorter life span, and would 
be deleted anyway once the application completes. For the newly created dynamic 
queue, resource configuration is going to be based on the newly defined quota 
and works as expected.
+
+Quota could be lowered through annotations as well. In such cases, default 
behavior would be applied as there is no way to set preemption.delay through 
annotations as of now. It means, a new lowered quota would be applied 
immediately only for new requests and existing applications should continue to 
run as is and go through natural exit.
+
+## Future Plans
+
+1. Victims could be further segregated based on the nodes. Once node wise 
victims are available, node fulfillment processes like bin packing etc could be 
brought into picture.  
+2. Introduce a “closest match” algorithm to choose the victims which is less 
intrusive from a general perspective. Prefer “closest match” victim by starting 
with choosing victims whose resource types exactly match with “preemptable 
resources”, then ordered (ascending) by count of non matching resource types.
+
diff --git a/sidebars.js b/sidebars.js
index 3abb47de95..9b0c80013d 100644
--- a/sidebars.js
+++ b/sidebars.js
@@ -103,6 +103,7 @@ module.exports = {
                     'design/cache_removal',
                     'design/preemption',
                     'design/simple_preemptor',
+                    'design/quota_change_enforcement_through_preemption',
                     'design/generic_resource',
                     'design/priority_scheduling',
                     'design/resilience',


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(yunikorn-site) branch master updated: [YUNIKORN-3134] Publish design doc (#538)

Reply via email to