-------
Preface
=======
There have been quite a few informal discussions on this topic, and it's time
that we bring this formally to the yunikorn-dev mailing list for further
discussion...
-------
Summary
=======
We are actively planning to deprecate the YuniKorn plugin mode for eventual
removal. This has been an experimental feature since YuniKorn 1.0.0, but has
not proven to be as stable or performant as our default deployment mode.
Additionally, it has proven to be a large maintenance burden -- even for
contributors who do not actively use it.
-------
History
=======
To adequately explain the current situation and why this is being planned, it's
helpful to understand some of the history of both Kubernetes and YuniKorn and
how they interact.
Approximately three years ago, the Kubernetes community decided to implement an
internal Plugin API to help streamline the Kubernetes scheduler codebase. This
API is also known as the Scheduling Framework [1]. At the time of the
announcement, very few plugins had been implemented, and the API was positioned
as a way to extend scheduler functionality in an easier fashion. The choice to
name it a "plugin" API unfortunately invokes a lot of incorrect connotations,
especially around intended use. When most developers think of "plugins" they
think of 3rd party extensions to things like web browsers. The Kubernetes
Scheduler Plugin API is an internal API framework, primarily meant for use by
internal components, as evidenced by the fact that it only exists in the
internal kubernetes project, and not in any of the externally visible (and
public) modules. To make use of the Kubernetes scheduling framework, all
plugins must be compiled together from source into a single unified scheduler
binary.
At the time of the announcement, it seemed to those of us working on YuniKorn
at Cloudera that this could provide a cleaner way for YuniKorn to integrate
with Kubernetes and hopefully provide a version of YuniKorn which would have
improved compatibility with the default Kubernetes scheduler. Work was begun on
an internal prototype at Cloudera which had a number of significant limitations
but did (somewhat) work. That prototype was largely rewritten and contributed
upstream as part of YuniKorn 1.0.0 in May of 2022 and marked as experimental.
Since YuniKorn 1.0, ongoing enhancements have been made to this feature.
However, nearly two years after the initial public implementation, the plugin
mode has not lived up to its promise and in fact has hindered progress on
achieving a stable YuniKorn scheduler (more on this later).
In the mean time, much has changed in the implementation of the upstream
Kubernetes scheduler. The scheduler has moved from a monolithic collection of
features into a simple event loop that calls into scheduler plugins to perform
all of the scheduling tasks. There is no longer any core functionality that is
implemented outside of the plugins themselves.
Somewhat counterintuitively, this has resulted in increased stability for the
standard YuniKorn deployment model. Prior to the existence of the plugin API,
YuniKorn contained a lot of logic to essentially re-implement functionality
from the default scheduler in the K8Shim. While this worked, it created
potential incompatibilities as the two codebases evolved independently. As the
plugin API became more stable and more core functionality was implemented with
it, YuniKorn transitioned to calling into those plugins for that functionality.
Today, the standard deployment of YuniKorn leverages all of the upstream
Kubernetes scheduler functionality by calling into the same plugins that the
default scheduler does. This means we have never been more compatible than we
are today.
At the same time, we now have multiple years of data to indicate that the
plugin version of YuniKorn has not improved compatibility or stability at all
(in fact quite the opposite).
------------------------------------
YuniKorn -- Standard vs. plugin mode
====================================
In the standard YuniKorn deployment mode, YuniKorn acts as a standalone
scheduler, grouping pods into applications, assigning those applications to
queues, and processing the requests in those queues using configurable
policies. When requests are satisfied, YuniKorn binds each pod to a node, and
proceeds with the next request. As part of determining where (or if) a pod may
be scheduled, YuniKorn calls into the default scheduler plugins to evaluate the
suitability of a pod to a particular node. This means that as new plugins are
added to the default scheduler, we automatically gain the same (compatible)
functionality within YuniKorn simply by building (and testing) against a newer
Kubernetes release.
When YuniKorn itself is built as a plugin to the default scheduler, the
situation is much more complex. It's helpful to visualize the resulting
scheduler as having a "split-brain" architecture. On the one side, we have
YuniKorn operating much as it normally does, processing pods into applications
and queues, making scheduling decisions (including calling into the official
Kubernetes scheduler plugins). The one major difference is that pods are not
bound by this scheduler, they are simply marked internally as ready. In the
other half of the brain, we have the default Kubernetes scheduler codebase
running, with a special "yunikorn" plugin defined as the last one in the plugin
chain. This plugin implements primarily the PreFilter and Filter scheduler API
functions. The PreFilter function is given a candidate pod and asked if it is
schedulable. If that returns true, the Filter function is then called with the
same candidate pod once for each possible node that may be schedulable and
asked if that combination is valid. The "yunikorn" plugin PreFilter
implementation simply returns true if the real YuniKorn scheduler has assigned
a pod, and false otherwise. The Filter implementation checks that the node
YuniKorn has assigned matches the requested node.
There are a number of limitations in the Plugin API that make this level of
complexity necessary. By design, plugins are not allowed to interact with the
scheduler directly, and must wait for plugin lifecycle methods (such as Filter
and PreFilter) to be called on them by the scheduler. Plugins are also not
allowed to interact with other plugins. YuniKorn requires both of these
abilities in order to function at all.
Direct access to the scheduler is necessary in order to promote a pod back to a
schedulable queue when it becomes ready. Since we do not have this ability when
running in plugin mode, we have to resort to ugly hacks such as modifying a
live pod in the API server so that the Kubernetes scheduler will pick it up and
re-evaluate it.
YuniKorn needs to be able to interact with plugins to perform its own
evaluations of (pod, node) combinations. Since we have no access to the plugin
chain instantiated by the Kubernetes scheduler (and in fact no access to the
scheduler object itself), we instantiate a parallel plugin chain with the same
configuration. This means we have duplicate watchers, duplicate caches,
duplicate plugins, and duplicate processing chains. Because of this, there is
no guarantee which of the two halves of our "split-brain" scheduler will
process a new pod first. If it happens to be YuniKorn, we mark the pod
schedulable (assuming it fits) and wait for the Kubernetes scheduler to
interact with the yunikorn plugin. However, if the Kubernetes scheduler picks
it up first, it will immediately ask the yunikorn plugin whether or not the pod
is schedulable, and since the plugin has no knowledge of it yet, it must
respond negatively. This results in the pod being moved to the "unschedulable"
queue within the Kubernetes scheduler, where it may remain for quite some time,
leading to difficult-to-diagnose scheduling delays. Even worse, because there
is parallel state being kept between the two schedulers, and the consistency of
that state changes independently as cluster state changes, it's possible for
the plugin chain that the Kubernetes scheduler uses and the one used internally
by YuniKorn to arrive at different conclusions about whether a particular pod
is schedulable on a particular node. When this happens, YuniKorn internally
believes the pod is schedulable, and the Kubernetes scheduler does not, leading
to a pod being left in limbo that doesn't make forward progress. We have
observed this behavior in real clusters, and there really is no solution.
After almost three years working on this feature, we are still left with
fundamentally unsolvable issues such as this that arise because of the
inability to shoehorn YuniKorn's extensive functionality into the
(purposefully) limited Scheduler Plugin API.
Due to all the duplicate processing processing and data structures required to
implement YuniKorn as a plugin, as well as the inherent inefficiencies of the
plugin API, we see scheduling throughput improvements of 2-4x and nearly half
the memory usage when using the standard YuniKorn deployment mode vs. the
plugin implementation. The standard deployment model is also much more stable,
as there is a single source of truth for YuniKorn and scheduler plugins to use.
Since we call into all the standard plugins as part of pod / node evaluation,
we support ALL features that the default scheduler does within YuniKorn.
---------------------
Impact on development
=====================
The plugin feature also imposes a drain on the development process. It doubles
our testing efforts, as we need to spin up twice as many end-to-end testing
scenarios as before (one for each Kubernetes release we support x 2 for both
scheduler implementations). Contributors often don't test with the plugin
version early, and because the two models are architecturally very different,
it's very common for developers to push a new PR, wait nearly an hour for the
e2e tests to complete, only to find that the board is half green (standard
mode) and half red (plugin mode). This results in increased dev cycles and a
major loss in productivity which would be eliminated if we no longer needed to
maintain two implementations.
------------------------
Impact on supportability
========================
Many of you may recall the pain caused during the YuniKorn 1.4.0 release cycle
as we were forced to drop support for Kubernetes 1.23 and below from our
support matrix. It was simply impossible to build a YuniKorn release that could
work on both Kubernetes 1.23 and 1.27 simultaneously. The fact is, that
limitation was caused by the existence of the plugin mode. Had we not been
limited by having the plugin functionality integrated, we would have been able
to build against newer Kubernetes releases and still function at runtime on
older clusters. We discovered this at the time, but decided it was best to not
fragment the release by having some builds available on old Kubernetes releases
and others not.
The very low-level and internal nature of the plugin API causes this to be an
ongoing risk for future release efforts as well. Considering that upstream
Kubernetes is currently in discussions to fundamentally redesign how resources
are used, this risk may become reality much sooner than we would like. It's not
inconceivable that we may end up with a "flag day" of something like Kubernetes
1.32 being unsupportable at the same time as 1.33 (versions are chosen for
illustrative purposes, not predictive of when breakage may occur). This risk is
much higher when deployment of YuniKorn in plugin mode is required.
------------------
Migration concerns
==================
For the most part, the standard and plugin deployment modes are interchangeable
(by design). The activation of the plugin mode is done by setting the helm
variable "enableSchedulerPlugin" to "true", so reverting to the standard mode
can be as simple as setting that variable to "false". This is especially true
if YuniKorn is being run with out-of-box default configuration. It is expected
that the "enableSchedulerPlugin" attribute will be ignored, beginning with the
same release where the plugin stops being enabled by default.
There is one area in which the two implementations differ behaviorally that may
need to be addressed depending on how YuniKorn is being used. The YuniKorn
admission controller supports a pair of configuration settings
("admissionController.filtering.labelNamespaces" and
"admissionController.filtering.noLabelNamespaces") which allow pods to be
tagged with "schedulerName: yunikorn" but not have an Application ID assigned
to them if one was not already present. This is typically used in plugin mode
to send non-YuniKorn pods to the YuniKorn scheduler but have the normal
YuniKorn queueing logic bypassed.
When using this feature, non-labeled pods arrive at the YuniKorn scheduler
without an Application ID assigned, causing the yunikorn plugin to disable
itself and use only the Kubernetes scheduler processing chain. In the standard
YuniKorn deployment mode (as of YuniKorn 1.4+), these pods are automatically
assigned a synthetic Application ID and processed in the same way as all other
pods. Therefore, it is important to ensure that these pods are able to be
mapped into an appropriate queue. When using the default, out of the box
configuration, this already occurs, as YuniKorn ships with a single default
queue and all pods map to it. However with custom configurations, it is
necessary to ensure that a queue exists and existing workloads can map
successfully to it (ideally via placement rules). For maximal compatibility,
this queue should be unlimited in size (no quota).
We understand that this is a gap in behavior that would need to be fixed when
migrating from the plugin mode to the standard mode. We do not have a change or
solution ready for that gap yet. However, the placement rules and queue
configuration are flexible enough to allow us to create a fix for this. We
believe we will be able to provide the first steps towards closing that gap as
part of the next release.
------------------
Potential timeline
==================
There are known users in the community of the plugin feature, so care must be
taken in how and when the feature is removed. We need to give users time to
migrate. We propose the following release timeline:
- YuniKorn 1.6.0 - Announce the deprecation of the plugin model, but no code
changes.
- YuniKorn 1.7.0 - Emit warnings when the plugin mode is active, but nothing
else.
- YuniKorn 1.8.0 - Stop testing and building the plugin as part of the normal
development cycle. [**]
- YuniKorn 1.9.0 - Remove the implementation entirely.
[**] We do not intend to break compilation of the plugin scheduler as part of
the 1.8.0 release, but will no longer provide pre-compiled binaries. Users
could still build the plugin themselves if required, but it would be untested
and unsupported.
Given YuniKorn releases tend to arrive at approximately 4 month intervals, and
we are midway through the 1.6.0 development cycle, this gives roughly 18 months
until the feature will be removed completely (of course, this is only an
estimate and not a commitment). For context, this is nearly as long as the
feature has been available publicly at all.
---------------------------------
Frequently asked questions (FAQs)
=================================
- Why can't we just keep the existing plugin implementation around? Surely it
can't be that difficult to maintain.
It's not simply a matter of difficulty in maintenance, though that is certainly
a concern. There are several "if plugin mode" branches in the k8shim that would
be eliminated. Additionally half of our e2e tests, which are run on every PR
push, would no longer need to be run (and diagnosed when they fail). More
importantly, we insulate ourselves from future Kubernetes code changes as we no
longer need to reach as deep into private Kubernetes API. This has already
proven to be an issue during the Kubernetes 1.23 - 1.24 transition, and is very
likely to be an issue again. We would like to ensure support for the widest
list of Kubernetes releases possible, and eliminating this code makes that much
easier.
- Can YuniKorn support custom external scheduler plugins? Would this support
change when the YuniKorn plugin mode no longer exists?
YuniKorn currently does not support building with external scheduler plugins.
While that is in theory possible, due to the duplicate plugin lists that the
Kubernetes and YuniKorn schedulers use, it is extremely complex and
non-trivial. Even custom configuration for existing plugins is problematic.
Eliminating yunikorn as a plugin actually makes this much more viable, as we
could introduce functionality to customize the configuration of existing
plugins, and users could patch YuniKorn with external plugins much more easily.
- Don't I need the plugin mode in order to deploy YuniKorn on a large existing
cluster without fear of breaking things?
No. In fact, using the plugin mode in this way introduces much more potential
instability than the standard deployment does, and in fact means that instead
of two schedulers in the cluster, you now have three (two of them just happen
to live in the same process). The plugin mode is known to be slow, consume
large amounts of memory, and be unstable under load.
It's also a myth that reusing the default scheduler code leads to better
compatibility with the default scheduler. In addition to the instability plugin
mode introduces, you are still building a custom scheduler that is very likely
built against a different Kubernetes release than what your cluster is running.
For example, unless you are running a Kubernetes 1.29.2 cluster with default
configurations (which is what YuniKorn 1.5.0 uses), your scheduler
implementation is not going to match the underlying cluster at all. This is one
of the reasons we run extensive end-to-end testing to help catch potential
issues, but this isn't something that improves by using plugin mode.
In short, regardless of which implementation you use, there's no substitute for
adequate testing in non-prod environments. The perception may be there that
plugin mode reduces this burden, but it really doesn't. It adds significant
complexity and instability which cannot be addressed.
Since the default YuniKorn deployment mode calls into all the scheduler plugins
just as the default Kubernetes scheduler does, and in much the same way, it
actually has the highest compatibility with the default Kubernetes scheduler.
This isn't just theoretical -- we have multiple years of data running both
implementations on a large variety of clusters that bears this out. Standard
mode simply works better.
--------------
External links
--------------
[1] Scheduling Framework:
https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/