This is an automated email from the ASF dual-hosted git repository.
mani pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/yunikorn-site.git
The following commit(s) were added to refs/heads/master by this push:
new 7d583cfcde [YUNIKORN-3134] Publish design doc (#538)
7d583cfcde is described below
commit 7d583cfcded320a6a8edf3f284c5e04e53338f50
Author: Manikandan R <[email protected]>
AuthorDate: Thu Jan 22 13:39:20 2026 +0530
[YUNIKORN-3134] Publish design doc (#538)
Closes: #538
Signed-off-by: Manikandan R <[email protected]>
---
...tion_successful_best_effort_preemption_case.png | Bin 0 -> 39146 bytes
...quota_preemption_successful_preemption_case.png | Bin 0 -> 45757 bytes
...ota_preemption_unsuccessful_preemption_case.png | Bin 0 -> 33860 bytes
docs/design/quota_preemptor.md | 223 +++++++++++++++++++++
sidebars.js | 1 +
5 files changed, 224 insertions(+)
diff --git
a/docs/assets/quota_preemption_successful_best_effort_preemption_case.png
b/docs/assets/quota_preemption_successful_best_effort_preemption_case.png
new file mode 100644
index 0000000000..0f813b863e
Binary files /dev/null and
b/docs/assets/quota_preemption_successful_best_effort_preemption_case.png differ
diff --git a/docs/assets/quota_preemption_successful_preemption_case.png
b/docs/assets/quota_preemption_successful_preemption_case.png
new file mode 100644
index 0000000000..8efda70774
Binary files /dev/null and
b/docs/assets/quota_preemption_successful_preemption_case.png differ
diff --git a/docs/assets/quota_preemption_unsuccessful_preemption_case.png
b/docs/assets/quota_preemption_unsuccessful_preemption_case.png
new file mode 100644
index 0000000000..1eacf6bb6c
Binary files /dev/null and
b/docs/assets/quota_preemption_unsuccessful_preemption_case.png differ
diff --git a/docs/design/quota_preemptor.md b/docs/design/quota_preemptor.md
new file mode 100644
index 0000000000..2edc8e6fe4
--- /dev/null
+++ b/docs/design/quota_preemptor.md
@@ -0,0 +1,223 @@
+---
+id: quota_change_enforcement_through_preemption
+title: Quota Enforcement through Preemption
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## Motivation
+
+Queue quota can be increased or decreased based on the need. In case of an
increase, the scheduler works as normal and no change in behavior. As usual, it
keeps permitting both the new and existing apps to run based on the newly
configured quota. In case of an decrease, the scheduler stops accepting new
workloads and waits for the running applications' natural exit so that newly
configured quota comes into effect eventually. This behavior of natural exit is
acceptable as long all the run [...]
+
+## Goals
+
+* Implement Preemption only for quota decrease.
+* Honour Queue Guaranteed Resources
+* Inter-Queue Preemption configuration should not be used in conjunction
+* Queue Priority related configurations should not be used in conjunction
+
+## Non-Goals
+
+* Intra-Queue Preemption
+* Cross Node Preemption
+
+## Quota Preemption Configuration
+
+Quota Enforcement through Preemption feature can be turned off globally by
setting the
+appropriate property at partition level. It is configurable as follows:
+
+```yaml
+partitions:
+ - name: <name of the partition>
+ preemption:
+ quotapreemptionenabled: <boolean value>
+ queues:
+ - name: <name of the queue>
+ resources:
+ max: <maximum resources allowed for this queue>
+```
+
+Default is false (disabled). It means, the whole feature is turned OFF
globally and preemption won’t be triggered when lowering the quota which is
nothing but the situation exists today. So, having default as false is to
retain the existing behavior as is.
+
+Setting it to true would turn ON the feature globally and preemption would be
triggered whenever any queue quota decreases.
+
+## Quota Preemption Delay
+
+Quota Preemption Delay is the time duration after which the preemption should
get triggered for quota changes. Otherwise, triggering preemption immediately
has a profound impact especially on the queues where long live applications
run. It is applicable only for quota decrease, nothing to do with increase. It
is configurable.Unit is seconds.
+
+It is configurable as follows:
+
+```yaml
+partitions:
+ - name: <name of the partition>
+ queues:
+ - name: <name of the queue>
+ resources:
+ max: <maximum resources allowed for this queue>
+ quota.preemption.delay: <quota preemption delay in seconds>
+```
+It could be any value between 0 and maxint.
+
+The default is 0. It means, preemption won’t be triggered when lowering the
quota which is nothing but the situation exists today. As of now, In case of
quota decrease, new quota would be applied and brought into effect only for the
new requests during the next scheduling cycle but existing applications
continue to run as is. So, having default as 0 is to retain the existing
behavior as is.
+
+Setting any value between 60 seconds and 5 hours is preferable for most of the
cases.
+
+Examples are
+
+1\) To trigger preemption after 2 hours
+
+```yaml
+partitions:
+ - name: default
+ queues:
+ - name: queueA
+ resources:
+ max:
+ {memory: 10G}
+ quota.preemption.delay: 7200
+```
+
+2\) To trigger preemption immediately after 5 minutes
+
+```yaml
+partitions:
+ - name: default
+ queues:
+ - name: queueA
+ resources:
+ max:
+ {memory: 10G}
+ quota.preemption.delay: 5
+```
+
+The delay timer clocks in once the config map has been saved. Config map could
be modified for multiple scenarios, but the timer clocks in only when either
Queue max resources or quota preemption delay property changes. Once the timer
started, changes could happen to either one of these two properties. In case of
any such changes, timers reset and start again from the beginning, not on top
of the current timer value to avoid unnecessary confusions. Manipulation based
on the current timer [...]
+In addition, changes could be made to more than one queue at the same time but
with different delay values. In case of the same delay for leaf queue and other
parent queue in the whole queue hierarchy, leaf queue could be prioritized over
the parent as it might help the [Queue selection
process](#queue-selection-and-ordering) for the parent as described later. Need
not to go through further on this in detail here as these cases would be sorted
out appropriately during the implementation.
+
+### Impact of Restart
+
+How does Quota Preemption Delay work after the quota changes followed up by
restart? As explained earlier, the clock kicks in once the config map has been
saved, say T1. Yunikorn restarted at T2. When service starts again after T2,
Quota Preemption Delay would start again from the beginning and lead to
postponing this activity due to lost time (T2-T1) as opposed to earlier
schedule.
+
+## Meaningful Quota Decrease
+
+When Queue has both max resources and guaranteed resources set, decreasing
quota to lower value should always be greater than guaranteed resources.
Otherwise, it would violate the [goal](#goals) of considering guaranteed
resources also into account and making sure usage doesn't fall below guaranteed
resources as part of the preemption process. Preemption should not make usage
fall below guaranteed resources at any cost. Config validation is required to
ensure max resource is always great [...]
+
+## Preemptable Resource calculation
+
+Whenever any queue quota changes, the first thing would be to understand the
usage. If the usage is below or equal to the new quota, no need to trigger
preemption. Otherwise, preemption would be triggered. We need to calculate the
resources that need to be preemptable in order to bring newer quotas into
effect. So, a preemptable resource is the amount of resources that needs to be
preempted in order to bring the new quota into effect immediately.
+
+Preemptable resource “preemptable resource” would be derived by subtracting
the current usage from newly defined “max resources” quota using subtract
method
+
+resources.subOnlyExisting(existing, newer)
+
+as resource types might differ. Different possible scenarios are lowering all
resource types, lowering specific resource types, complete elimination of
specific resource types from its earlier configurations, brand new resource
types not present in its earlier configurations etc. Using the above said
resource subtraction method would guarantee the reliable delta calculation.
+
+Results could contain all positive values or negative values or combinations
of both. Results derived containing negative value for any resource type are
the only ones that need to be preempted. Nothing needs to be done for all other
resource types as usage should be either equal or below the corresponding
resource types in ““max resources” value. So, “preemptable resource” contains
only resource types which need to be preempted.
+
+Below table covers the various possible scenarios for the queue `root.a.b`
with examples:
+
+|Queue| Current Max | New Max | Usage
| Preemptable Resources |
+|--|-----------------------|-----------------------|-----------------------|-----------------------|
+| `root.a.b` | \{M: 100G\} | \{M: 50G\} | \{M: 80G\}
| \{M: 30G\} |
+| | Nil | \{M: 50G\} | \{M: 80G\} |
\{M: 30G\} |
+| | \{M: 100G, CPU: 100\} | \{M: 50G, CPU: 50\} | \{M: 80G, CPU: 80\} |
\{M: 30G, CPU: 30\} |
+| | \{M: 100G, CPU: 100\} | \{M:200G, CPU: 50\} | \{M: 100G, CPU: 80\} |
\{CPU: 30\} |
+| | \{M: 100G\} | \{M: 100G, CPU: 100\} | \{M: 50G, CPU: 500\} |
\{CPU: 400\} |
+| | \{M: 100G, CPU: 100\} | \{M: 50G\} | \{M: 80G, CPU: 100\} |
\{M: 30G\} |
+| | \{M: 100G\} | \{CPU: 100\} | \{M: 100G, CPU: 500\} |
\{CPU: 400\} |
+
+## Lowering Leaf Queue Quota
+
+Whenever any leaf queue quota changes, the above described preemptable
resource should be used as the target. Once we know the target, the next step
is to go over the tasks running in the queue being worked upon to choose the
potential victims.
+
+### How are victims selected and processed?
+
+Collect all running tasks based on below described criteria:
+
+1. Ignore Daemon set pods
+2. Choose pods which have “at least one” match with preemptable resources.
+
+Once the set of tasks been collected, same has to be sorted based on the below
described preference rules:
+
+1. Prefer Non originator pods over originator
+2. In case of tie in rule \#1, prefer pods based on priority. Lower priority
picked up first followed by high priority pods.
+3. In case of tie in rule \#2, Prefer pods for which the “allowPreemption”
flag has been set to true over other pods.
+4. In case of tie in rule \#3, prefer “young” pods based on the age
+
+
+Once the tasks are sorted, tasks can be traversed one by one and preempted to
free up resources until we reach the target.
+
+## Lowering Parent Queue Quota
+
+Whenever any leaf queue quota changes, we know the specific queue to work on.
Whereas, in case of lowering parent queue quota, we need to traverse all the
way down to collect the end leaf queues because it is the place where actual
tasks need to be preempted. But how do we select the queue(s) among different
queue paths located under the parent queue for which quota has been decreased?
+
+### Queue selection and ordering
+
+Child Queues with nil usage and queues for which [preemption is already in
progress](#quota-preemption-delay) would be eliminated in the beginning itself
to avoid participation in below described steps. Once elimination is over, an
above calculated preemptable resource could be distributed fairly among the
remaining child queues as described below.
+
+Usage above the guaranteed resources only is allowed to be preempted to ensure
that this preemption doesn’t end up in taking usage below the guaranteed
resources. So, Usage above Guaranteed resources is calculated and curved out
from each child queue and termed as “releasable resources”. It is nothing but
subtracting the usage from the configured guaranteed resources. Guaranteed
resources might have been configured only for a few types but not for all. So,
all different combinations are [...]
+
+Once “releasable resources” is derived, calculate “total releasable resources”
by adding all child queues “releasable resources”. After that “releasable
resources %” for each child queue could be derived by dividing its “releasable
resources” by “total releasable resources”. “releasable resources %” might have
different values for each resource type as it all depends on the guaranteed
resources configuration and usages. Now, applying “releasable resources %” on
above calculated preemptab [...]
+
+Honoring Guaranteed resources on one specific queue would increase or decrease
the preemption rate on other siblings is a side effect caused by the above
shift which is something admin should take into account while doing quota
readjustments. We can introduce a configuration to decide the percentage of
distribution among queues or queue based individual preemption property to
permit the maximum preemptable resource later.
+
+Since we know the preemptable resource target for each leaf queue located
under the current parent queue being worked upon, the next step is to repeat
the same process for all levels until we reach the leaf queues branched from
different queues at different levels. As always, a recursive approach would be
followed here too. After reaching the end leaf queues, next step is to follow
the steps described in [victims selection
process](#how-are-victims-selected-and-processed?)
+
+Queue selection process prior to doing [victims selection
process](#how-are-victims-selected-and-processed?) has a disadvantage of not
doing sorting based on pod significance across all queues. However, given the
advantage as described above, Queue selection ensures fairness in the
preemption process triggered through lowering of parent queue quota.
+
+## “Preemption does not help” situations
+
+Resources used by each victim may or may not differ. Selecting victims from
potential victims involves necessary checks, one among them is to ensure usage
doesn’t go below the guaranteed resources. So, there could be chances that
there are no viable options to proceed further. For example, Preemptable
resources is \{memory: 40 GB\}, Guaranteed resources: \{memory: 50 GB\} and
Usage is \{memory: 100 GB\}. 5 victims. Each victim is using \{memory: 20
GB\}. Preempting 2 victims is not a pr [...]
+Unlike Intra queue preemption, where in asks (preemptor) goes through
preemption process every scheduling cycle with an interval of \`X\` seconds
and probability of preemption yielding positive outcome after multiple attempts
is higher, this preemption process doesn’t have a such inherent retry
capabilities to increase the probability of successful preemption. Even if it
is available, it is not going to help much as running victims are long lived
workloads and not going to change immedi [...]
+
+However, there could be situations where the above explained “best effort”
might not be applied to get best out of the process as it depends on the no. of
potential victims available at any moment. If there are not enough victims
available, maybe only 1 victim, preemption does not help at any cost as
attempting further would make usage go below the guaranteed resources.
+
+Administrators should be informed about all these kinds of situations to avoid
surprises through well defined mechanisms.
+
+
+
+
+
+
+
+## Relevance of existing Intra Queue Preemption Configurations and Queue Setups
+
+In short, all existing Intra Queue Preemption Configurations have no relevance
from this design point of view and should not be used in conjunction with newly
added configurations.
+
+### Preemption Fence & Preemption Delay
+
+Existing Preemption related properties like preemption.policy,
preemption.delay should not be considered or used in conjunction with this
[preemption](#motivation) as the objective is completely different from the
other one. In addition, the reason for not considering preemption.policy
(fence) into account especially for the child queues underneath the current
parent queue being worked upon is to avoid unbiased and incorrect decisions. A
fence is a unidirectional way of traversing down t [...]
+
+Queue=\>properties=\>preemption.delay should not be confused with above
discussed Quota Change Preemption delay
[Queue=\>resources=\>preemption.delay](#quota-preemption-configuration) and
treated differently as earlier has been introduced to compensate for the
scheduling cycle interval and used only in the scheduling cycle core path.
+
+### Priority Fence & Offset
+
+priority.policy (fence) & priority.offset makes an impact while working across
different queues. These two properties have been used in Inter queue preemption
before choosing the victims from queues located under different queue paths.
Whereas, this preemption looks inside the queue or queues located underneath
the parent queue being worked upon to apply quota change. So, above said
priority related properties need not to be considered.
+
+### Dynamic Queues
+
+When a Dynamic queue is created, it uses the child template defined on its
immediate queue or any ancestor queue in the whole queue hierarchy. When any
lowering quota happens on existing child templates, the currently used dynamic
queue can be left as is as the queue itself has a shorter life span, and would
be deleted anyway once the application completes. For the newly created dynamic
queue, resource configuration is going to be based on the newly defined quota
and works as expected.
+
+Quota could be lowered through annotations as well. In such cases, default
behavior would be applied as there is no way to set preemption.delay through
annotations as of now. It means, a new lowered quota would be applied
immediately only for new requests and existing applications should continue to
run as is and go through natural exit.
+
+## Future Plans
+
+1. Victims could be further segregated based on the nodes. Once node wise
victims are available, node fulfillment processes like bin packing etc could be
brought into picture.
+2. Introduce a “closest match” algorithm to choose the victims which is less
intrusive from a general perspective. Prefer “closest match” victim by starting
with choosing victims whose resource types exactly match with “preemptable
resources”, then ordered (ascending) by count of non matching resource types.
+
diff --git a/sidebars.js b/sidebars.js
index 3abb47de95..9b0c80013d 100644
--- a/sidebars.js
+++ b/sidebars.js
@@ -103,6 +103,7 @@ module.exports = {
'design/cache_removal',
'design/preemption',
'design/simple_preemptor',
+ 'design/quota_change_enforcement_through_preemption',
'design/generic_resource',
'design/priority_scheduling',
'design/resilience',
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]