Hi Craig

Thanks for the write-up and it clearly explains the rationale behind this.
I appreciate you listing all these things in detail to help all community
members understand
in depth about the issue. Thank you!

I agree with all the points you listed. My concern is from another angle,
though the plugin mode is not as elegant as the standalone mode, it works
and
serves the purpose of bringing yunikorn values without breaking existing
things. This helps some
users that run mixed things on large existing K8s clusters. Removing this
feature means the users
have to migrate which can be risky for production.

Dig a little more into the "migration concerns" section, if we suggest the
users migrate from the plugin to
standalone mode, what are the implications to the existing workloads
running with the default scheduler?
I checked the scheduler configuration
<https://kubernetes.io/docs/reference/scheduling/config/>, and
theoretically, I think we can support most of them, but I am not sure
if we can support the plugins that implement multiple extension points.
Should we have a document to explain
the compatibility of the standalone yunikorn scheduler vs the default
scheduler, and potential limitations if any,
to help the existing plugin mode users better understand the impact of
migration?

Another thing is I want to understand all the concerns if we want to keep
this feature, I think you made it clear about
1. Maintenance effort - make sure the quality of the plugin mode same level
as the standalone mode
2. Complexity of the e2e tests
3. Deeper dependency on the Kubernetes private APIs
I want to understand more about #3, since now we are leveraging scheduling
framework APIs, is this still an outstanding issue?

Hope this makes sense.
Weiwei





On Tue, Apr 9, 2024 at 12:03 PM Peter Bacsko <[email protected]> wrote:

> Hi all,
>
> thanks Craig for writing this excellent, detailed summary, including
> historical context.
>
> As we already talked about it on Slack, I'm definitely +1 for removing the
> plugin. My main gripes are:
>
> 1. It overcomplicates the codebase. Two branches for plugin vs non-plugin
> mode, scheduler cache tracks way too much state because of it (
> SchedulerCache: pendingAllocations, inProgressAllocations, schedulingTasks,
> taskBloomFilterRef; cache.Task: schedulingState).This adds significant
> complexity to the code, making maintainability and debugging difficult.
>
> 2. Some issues can only be solved with completely ugly hacks:
> a) Two scheduler caches can become out-of-sync and detecting this is a real
> challenge. We can run a goroutine which checks whether Yunikorn scheduled
> pods are indeed scheduled, but what if they aren't? Then these pods must be
> invalidated inside Yunikorn. We can remove and add it again which just
> feels wrong. We can add new code for it, which again, just complicates the
> code base. A shared scheduler cache would require a lot of copy-paste code
> so we can actually access it from the default scheduler. We probably need
> to create an intermediate layer which adapts one type to another. I can't
> even estimate how much effort this possibly could be.
>
> b) The extra K8s API call to activate Unschedulable pods (place them to the
> activeQ again). If there's no scheduling decision for a given pod, we mark
> it Unschedulable, but we need to re-trigger scheduling as soon as Yunikorn
> selects a target node. So we need to update the pod with a network round
> trip call to do this... Or we can retrieve the activeQ instance, which
> requires some copy-pasting from the existing kube-scheduler so we can
> obtain the reference. This also means that from time to time, we need to
> see how much our copy-pasted code deviates from the actual K8s code.
>
> 3. I believe all e2e tests runs on a single machine inside VMs/containers.
> So having a smaller test matrix means faster e2e test execution
> (hopefully).
>
> Suggested timeline looks very reasonable to me.
>
> Thanks,
> Peter
>
> On Tue, Apr 9, 2024 at 5:26 PM Craig Condit <[email protected]> wrote:
>
> > -------
> > Preface
> > =======
> >
> > There have been quite a few informal discussions on this topic, and it's
> > time that we bring this formally to the yunikorn-dev mailing list for
> > further discussion...
> >
> >
> > -------
> > Summary
> > =======
> >
> > We are actively planning to deprecate the YuniKorn plugin mode for
> > eventual removal. This has been an experimental feature since YuniKorn
> > 1.0.0, but has not proven to be as stable or performant as our default
> > deployment mode. Additionally, it has proven to be a large maintenance
> > burden -- even for contributors who do not actively use it.
> >
> >
> > -------
> > History
> > =======
> >
> > To adequately explain the current situation and why this is being
> planned,
> > it's helpful to understand some of the history of both Kubernetes and
> > YuniKorn and how they interact.
> >
> > Approximately three years ago, the Kubernetes community decided to
> > implement an internal Plugin API to help streamline the Kubernetes
> > scheduler codebase. This API is also known as the Scheduling Framework
> [1].
> > At the time of the announcement, very few plugins had been implemented,
> and
> > the API was positioned as a way to extend scheduler functionality in an
> > easier fashion. The choice to name it a "plugin" API unfortunately
> invokes
> > a lot of incorrect connotations, especially around intended use. When
> most
> > developers think of "plugins" they think of 3rd party extensions to
> things
> > like web browsers. The Kubernetes Scheduler Plugin API is an internal API
> > framework, primarily meant for use by internal components, as evidenced
> by
> > the fact that it only exists in the internal kubernetes project, and not
> in
> > any of the externally visible (and public) modules. To make use of the
> > Kubernetes scheduling framework, all plugins must be compiled together
> from
> > source into a single unified scheduler binary.
> >
> > At the time of the announcement, it seemed to those of us working on
> > YuniKorn at Cloudera that this could provide a cleaner way for YuniKorn
> to
> > integrate with Kubernetes and hopefully provide a version of YuniKorn
> which
> > would have improved compatibility with the default Kubernetes scheduler.
> > Work was begun on an internal prototype at Cloudera which had a number of
> > significant limitations but did (somewhat) work. That prototype was
> largely
> > rewritten and contributed upstream as part of YuniKorn 1.0.0 in May of
> 2022
> > and marked as experimental. Since YuniKorn 1.0, ongoing enhancements have
> > been made to this feature. However, nearly two years after the initial
> > public implementation, the plugin mode has not lived up to its promise
> and
> > in fact has hindered progress on achieving a stable YuniKorn scheduler
> > (more on this later).
> >
> > In the mean time, much has changed in the implementation of the upstream
> > Kubernetes scheduler. The scheduler has moved from a monolithic
> collection
> > of features into a simple event loop that calls into scheduler plugins to
> > perform all of the scheduling tasks. There is no longer any core
> > functionality that is implemented outside of the plugins themselves.
> >
> > Somewhat counterintuitively, this has resulted in increased stability for
> > the standard YuniKorn deployment model. Prior to the existence of the
> > plugin API, YuniKorn contained a lot of logic to essentially re-implement
> > functionality from the default scheduler in the K8Shim. While this
> worked,
> > it created potential incompatibilities as the two codebases evolved
> > independently. As the plugin API became more stable and more core
> > functionality was implemented with it, YuniKorn transitioned to calling
> > into those plugins for that functionality. Today, the standard deployment
> > of YuniKorn leverages all of the upstream Kubernetes scheduler
> > functionality by calling into the same plugins that the default scheduler
> > does. This means we have never been more compatible than we are today.
> >
> > At the same time, we now have multiple years of data to indicate that the
> > plugin version of YuniKorn has not improved compatibility or stability at
> > all (in fact quite the opposite).
> >
> >
> > ------------------------------------
> > YuniKorn -- Standard vs. plugin mode
> > ====================================
> >
> > In the standard YuniKorn deployment mode, YuniKorn acts as a standalone
> > scheduler, grouping pods into applications, assigning those applications
> to
> > queues, and processing the requests in those queues using configurable
> > policies. When requests are satisfied, YuniKorn binds each pod to a node,
> > and proceeds with the next request. As part of determining where (or if)
> a
> > pod may be scheduled, YuniKorn calls into the default scheduler plugins
> to
> > evaluate the suitability of a pod to a particular node. This means that
> as
> > new plugins are added to the default scheduler, we automatically gain the
> > same (compatible) functionality within YuniKorn simply by building (and
> > testing) against a newer Kubernetes release.
> >
> > When YuniKorn itself is built as a plugin to the default scheduler, the
> > situation is much more complex. It's helpful to visualize the resulting
> > scheduler as having a "split-brain" architecture. On the one side, we
> have
> > YuniKorn operating much as it normally does, processing pods into
> > applications and queues, making scheduling decisions (including calling
> > into the official Kubernetes scheduler plugins). The one major difference
> > is that pods are not bound by this scheduler, they are simply marked
> > internally as ready. In the other half of the brain, we have the default
> > Kubernetes scheduler codebase running, with a special "yunikorn" plugin
> > defined as the last one in the plugin chain. This plugin implements
> > primarily the PreFilter and Filter scheduler API functions. The PreFilter
> > function is given a candidate pod and asked if it is schedulable. If that
> > returns true, the Filter function is then called with the same candidate
> > pod once for each possible node that may be schedulable and asked if that
> > combination is valid. The "yunikorn" plugin PreFilter implementation
> simply
> > returns true if the real YuniKorn scheduler has assigned a pod, and false
> > otherwise. The Filter implementation checks that the node YuniKorn has
> > assigned matches the requested node.
> >
> > There are a number of limitations in the Plugin API that make this level
> > of complexity necessary. By design, plugins are not allowed to interact
> > with the scheduler directly, and must wait for plugin lifecycle methods
> > (such as Filter and PreFilter) to be called on them by the scheduler.
> > Plugins are also not allowed to interact with other plugins. YuniKorn
> > requires both of these abilities in order to function at all.
> >
> > Direct access to the scheduler is necessary in order to promote a pod
> back
> > to a schedulable queue when it becomes ready. Since we do not have this
> > ability when running in plugin mode, we have to resort to ugly hacks such
> > as modifying a live pod in the API server so that the Kubernetes
> scheduler
> > will pick it up and re-evaluate it.
> >
> > YuniKorn needs to be able to interact with plugins to perform its own
> > evaluations of (pod, node) combinations. Since we have no access to the
> > plugin chain instantiated by the Kubernetes scheduler (and in fact no
> > access to the scheduler object itself), we instantiate a parallel plugin
> > chain with the same configuration. This means we have duplicate watchers,
> > duplicate caches, duplicate plugins, and duplicate processing chains.
> > Because of this, there is no guarantee which of the two halves of our
> > "split-brain" scheduler will process a new pod first. If it happens to be
> > YuniKorn, we mark the pod schedulable (assuming it fits) and wait for the
> > Kubernetes scheduler to interact with the yunikorn plugin. However, if
> the
> > Kubernetes scheduler picks it up first, it will immediately ask the
> > yunikorn plugin whether or not the pod is schedulable, and since the
> plugin
> > has no knowledge of it yet, it must respond negatively. This results in
> the
> > pod being moved to the "unschedulable" queue within the Kubernetes
> > scheduler, where it may remain for quite some time, leading to
> > difficult-to-diagnose scheduling delays. Even worse, because there is
> > parallel state being kept between the two schedulers, and the consistency
> > of that state changes independently as cluster state changes, it's
> possible
> > for the plugin chain that the Kubernetes scheduler uses and the one used
> > internally by YuniKorn to arrive at different conclusions about whether a
> > particular pod is schedulable on a particular node. When this happens,
> > YuniKorn internally believes the pod is schedulable, and the Kubernetes
> > scheduler does not, leading to a pod being left in limbo that doesn't
> make
> > forward progress. We have observed this behavior in real clusters, and
> > there really is no solution.
> >
> > After almost three years working on this feature, we are still left with
> > fundamentally unsolvable issues such as this that arise because of the
> > inability to shoehorn YuniKorn's extensive functionality into the
> > (purposefully) limited Scheduler Plugin API.
> >
> > Due to all the duplicate processing processing and data structures
> > required to implement YuniKorn as a plugin, as well as the inherent
> > inefficiencies of the plugin API, we see scheduling throughput
> improvements
> > of 2-4x and nearly half the memory usage when using the standard YuniKorn
> > deployment mode vs. the plugin implementation. The standard deployment
> > model is also much more stable, as there is a single source of truth for
> > YuniKorn and scheduler plugins to use. Since we call into all the
> standard
> > plugins as part of pod / node evaluation, we support ALL features that
> the
> > default scheduler does within YuniKorn.
> >
> >
> > ---------------------
> > Impact on development
> > =====================
> >
> > The plugin feature also imposes a drain on the development process. It
> > doubles our testing efforts, as we need to spin up twice as many
> end-to-end
> > testing scenarios as before (one for each Kubernetes release we support
> x 2
> > for both scheduler implementations). Contributors often don't test with
> the
> > plugin version early, and because the two models are architecturally very
> > different, it's very common for developers to push a new PR, wait nearly
> an
> > hour for the e2e tests to complete, only to find that the board is half
> > green (standard mode) and half red (plugin mode). This results in
> increased
> > dev cycles and a major loss in productivity which would be eliminated if
> we
> > no longer needed to maintain two implementations.
> >
> >
> > ------------------------
> > Impact on supportability
> > ========================
> >
> > Many of you may recall the pain caused during the YuniKorn 1.4.0 release
> > cycle as we were forced to drop support for Kubernetes 1.23 and below
> from
> > our support matrix. It was simply impossible to build a YuniKorn release
> > that could work on both Kubernetes 1.23 and 1.27 simultaneously. The fact
> > is, that limitation was caused by the existence of the plugin mode. Had
> we
> > not been limited by having the plugin functionality integrated, we would
> > have been able to build against newer Kubernetes releases and still
> > function at runtime on older clusters. We discovered this at the time,
> but
> > decided it was best to not fragment the release by having some builds
> > available on old Kubernetes releases and others not.
> >
> > The very low-level and internal nature of the plugin API causes this to
> be
> > an ongoing risk for future release efforts as well. Considering that
> > upstream Kubernetes is currently in discussions to fundamentally redesign
> > how resources are used, this risk may become reality much sooner than we
> > would like. It's not inconceivable that we may end up with a "flag day"
> of
> > something like Kubernetes 1.32 being unsupportable at the same time as
> 1.33
> > (versions are chosen for illustrative purposes, not predictive of when
> > breakage may occur). This risk is much higher when deployment of YuniKorn
> > in plugin mode is required.
> >
> >
> > ------------------
> > Migration concerns
> > ==================
> >
> > For the most part, the standard and plugin deployment modes are
> > interchangeable (by design). The activation of the plugin mode is done by
> > setting the helm variable "enableSchedulerPlugin" to "true", so reverting
> > to the standard mode can be as simple as setting that variable to
> "false".
> > This is especially true if YuniKorn is being run with out-of-box default
> > configuration. It is expected that the "enableSchedulerPlugin" attribute
> > will be ignored, beginning with the same release where the plugin stops
> > being enabled by default.
> >
> > There is one area in which the two implementations differ behaviorally
> > that may need to be addressed depending on how YuniKorn is being used.
> The
> > YuniKorn admission controller supports a pair of configuration settings
> > ("admissionController.filtering.labelNamespaces" and
> > "admissionController.filtering.noLabelNamespaces") which allow pods to be
> > tagged with "schedulerName: yunikorn" but not have an Application ID
> > assigned to them if one was not already present. This is typically used
> in
> > plugin mode to send non-YuniKorn pods to the YuniKorn scheduler but have
> > the normal YuniKorn queueing logic bypassed.
> >
> > When using this feature, non-labeled pods arrive at the YuniKorn
> scheduler
> > without an Application ID assigned, causing the yunikorn plugin to
> disable
> > itself and use only the Kubernetes scheduler processing chain. In the
> > standard YuniKorn deployment mode (as of YuniKorn 1.4+), these pods are
> > automatically assigned a synthetic Application ID and processed in the
> same
> > way as all other pods. Therefore, it is important to ensure that these
> pods
> > are able to be mapped into an appropriate queue. When using the default,
> > out of the box configuration, this already occurs, as YuniKorn ships
> with a
> > single default queue and all pods map to it. However with custom
> > configurations, it is necessary to ensure that a queue exists and
> existing
> > workloads can map successfully to it (ideally via placement rules). For
> > maximal compatibility, this queue should be unlimited in size (no quota).
> >
> > We understand that this is a gap in behavior that would need to be fixed
> > when migrating from the plugin mode to the standard mode. We do not have
> a
> > change or solution ready for that gap yet. However, the placement rules
> and
> > queue configuration are flexible enough to allow us to create a fix for
> > this. We believe we will be able to provide the first steps towards
> closing
> > that gap as part of the next release.
> >
> >
> > ------------------
> > Potential timeline
> > ==================
> >
> > There are known users in the community of the plugin feature, so care
> must
> > be taken in how and when the feature is removed. We need to give users
> time
> > to migrate. We propose the following release timeline:
> >
> > - YuniKorn 1.6.0 - Announce the deprecation of the plugin model, but no
> > code changes.
> > - YuniKorn 1.7.0 - Emit warnings when the plugin mode is active, but
> > nothing else.
> > - YuniKorn 1.8.0 - Stop testing and building the plugin as part of the
> > normal development cycle. [**]
> > - YuniKorn 1.9.0 - Remove the implementation entirely.
> >
> > [**] We do not intend to break compilation of the plugin scheduler as
> part
> > of the 1.8.0 release, but will no longer provide pre-compiled binaries.
> > Users could still build the plugin themselves if required, but it would
> be
> > untested and unsupported.
> >
> > Given YuniKorn releases tend to arrive at approximately 4 month
> intervals,
> > and we are midway through the 1.6.0 development cycle, this gives roughly
> > 18 months until the feature will be removed completely (of course, this
> is
> > only an estimate and not a commitment). For context, this is nearly as
> long
> > as the feature has been available publicly at all.
> >
> >
> >
> > ---------------------------------
> > Frequently asked questions (FAQs)
> > =================================
> >
> > - Why can't we just keep the existing plugin implementation around?
> Surely
> > it can't be that difficult to maintain.
> >
> > It's not simply a matter of difficulty in maintenance, though that is
> > certainly a concern. There are several "if plugin mode" branches in the
> > k8shim that would be eliminated. Additionally half of our e2e tests,
> which
> > are run on every PR push, would no longer need to be run (and diagnosed
> > when they fail). More importantly, we insulate ourselves from future
> > Kubernetes code changes as we no longer need to reach as deep into
> private
> > Kubernetes API. This has already proven to be an issue during the
> > Kubernetes 1.23 - 1.24 transition, and is very likely to be an issue
> again.
> > We would like to ensure support for the widest list of Kubernetes
> releases
> > possible, and eliminating this code makes that much easier.
> >
> >
> > - Can YuniKorn support custom external scheduler plugins? Would this
> > support change when the YuniKorn plugin mode no longer exists?
> >
> > YuniKorn currently does not support building with external scheduler
> > plugins. While that is in theory possible, due to the duplicate plugin
> > lists that the Kubernetes and YuniKorn schedulers use, it is extremely
> > complex and non-trivial. Even custom configuration for existing plugins
> is
> > problematic. Eliminating yunikorn as a plugin actually makes this much
> more
> > viable, as we could introduce functionality to customize the
> configuration
> > of existing plugins, and users could patch YuniKorn with external plugins
> > much more easily.
> >
> >
> > - Don't I need the plugin mode in order to deploy YuniKorn on a large
> > existing cluster without fear of breaking things?
> >
> > No. In fact, using the plugin mode in this way introduces much more
> > potential instability than the standard deployment does, and in fact
> means
> > that instead of two schedulers in the cluster, you now have three (two of
> > them just happen to live in the same process). The plugin mode is known
> to
> > be slow, consume large amounts of memory, and be unstable under load.
> >
> > It's also a myth that reusing the default scheduler code leads to better
> > compatibility with the default scheduler. In addition to the instability
> > plugin mode introduces, you are still building a custom scheduler that is
> > very likely built against a different Kubernetes release than what your
> > cluster is running. For example, unless you are running a Kubernetes
> 1.29.2
> > cluster with default configurations (which is what YuniKorn 1.5.0 uses),
> > your scheduler implementation is not going to match the underlying
> cluster
> > at all. This is one of the reasons we run extensive end-to-end testing to
> > help catch potential issues, but this isn't something that improves by
> > using plugin mode.
> >
> > In short, regardless of which implementation you use, there's no
> > substitute for adequate testing in non-prod environments. The perception
> > may be there that plugin mode reduces this burden, but it really doesn't.
> > It adds significant complexity and instability which cannot be addressed.
> >
> > Since the default YuniKorn deployment mode calls into all the scheduler
> > plugins just as the default Kubernetes scheduler does, and in much the
> same
> > way, it actually has the highest compatibility with the default
> Kubernetes
> > scheduler. This isn't just theoretical -- we have multiple years of data
> > running both implementations on a large variety of clusters that bears
> this
> > out. Standard mode simply works better.
> >
> >
> > --------------
> > External links
> > --------------
> >
> > [1] Scheduling Framework:
> >
> https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/
> >
> >
>

Reply via email to