Eric Higgins created YUNIKORN-3025:
--------------------------------------
Summary: Support for application-level preemption
Key: YUNIKORN-3025
URL: https://issues.apache.org/jira/browse/YUNIKORN-3025
Project: Apache YuniKorn
Issue Type: New Feature
Components: core - scheduler
Reporter: Eric Higgins
We would like to use Yunikorn's gang scheduling feature to schedule ML training
jobs for different teams. We want to give each team a quota and allow them to
borrow resources from other teams' quotas, but have their job preempted if the
other team needs to use those resources. However, this seems to not be
supported currently, as Yunikorn is missing application-level preemption. It
will preempt individual pods until it has freed up enough resources, and those
pods may not be from the same application. This is an issue for us because our
training jobs are not fault-tolerant and will die if 1 pod gets killed, so we
want to preempt an entire application at the same time.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]