[
https://issues.apache.org/jira/browse/YUNIKORN-1347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anthony Wu updated YUNIKORN-1347:
---------------------------------
Description:
Hi guys,
We are trying to utilise Yunikorn to manage our AWS EKS infrastructure to limit
resource usage for different users and groups. We also use k8s cluster
auto-scaler
([https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler]) for
auto scaling of the cluster when necessary.
*Environment*
* AWS EKS on k8s 1.21
* Yunikorn 1.1 running as k8s scheduler plugin to be most compatible
* cluster-autoscaler V1.21.0
{*}Issues{*}:
Let's say we have quene has be below limit
{code:yaml}
queues:
- name: dev
submitacl: "*"
resources:
max:
memory: 100Gi
vcore: 10
{code}
Then we try to create 4 pods in the `dev` queue each requires 5 cores and 50Gi
memory
Then we are getting 2 pods {{Running}} and 2 pods {{{}Pending{}}}, because the
queue has reached its limit of 10Gi memory and 10 cpus.
We would expect the queued pods to not triggering EKS auto scaling, as they
would not be able to be allocated until other resources have been release in
the queue.
But what we see is that, the Queued pods still trigger the cluster auto-scaling
regardless. As shown in the example below:
{code:java}
Status: Pending
...
Conditions:
Type Status
PodScheduled False
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3m5s yunikorn 0/147 nodes are
available: 147 Pod is not ready for scheduling.
Warning FailedScheduling 3m5s yunikorn 0/147 nodes are
available: 147 Pod is not ready for scheduling.
Normal Scheduling 3m3s yunikorn
yunikorn/dask-user-07ff5f3b-8qjkl8 is queued and waiting for allocation
Normal TriggeredScaleUp 2m53s cluster-autoscaler pod triggered scale-up:
[{eksctl-cluster-nodegroup-spot-xlarge-compute-1-NodeGroup-8VURTD4WKCYV 0->4
(max: 16)}]
{code}
So eventually, EKS auto-added some hosts but not actually been used and
allocated as the pods are not approved to be scheduled yet.
We also tried Gang scheduling with the pods in a task group, but it is also
having similar issues: Even the whole gang is not ready to schedule, Yunikorn
creates the place-holder pods which triggers auto-scaling of EKS cluster
*Causes and potential solutions*
We tried to look at both source code in the auto-scaler and Yunikorn, and we
think the reason is just that the auto-scaler does not know about Yunikorn
specific events and state (Pending but not QuotaApproved) of a Pod. It searches
all the Pods with `PodScheduled=False` to then check whether it needs to add
resources for them.
The issue could be resolved from both side:
- To solve from auto-scaler side, it needs to know the special events and
state of Yunikorn
- To solve from Yunikorn side, I think it needs to not create the pod or at
least not in `Pending` phase until it is quota approved
** not sure how hard to achieve this, but as long as a pod is created and it
goes to Pending then auto-scaler will try to pick it up
We think solving it from Yunikron side would be cleaner, since auto-scaler
should not need to know the k8s scheduler implementation in order to make a
decision. Also there are other auto-scaler alternatives like AWS Karpenter
could suffers the same issue when interact with Yunikorn.
Wondering whether this issue report make sense to you guys. Let us know if
there are any other solutions and whether it is possible to be solved in future
:)
Thanks a lot!
was:
Hi guys,
We are trying to utilise Yunikorn to manage our AWS EKS infrastructure to limit
resource usage for different users and groups. We also use k8s cluster
auto-scaler
([https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler]) for
auto scaling of the cluster when necessary.
*Environment*
* AWS EKS on k8s 1.21
* Yunikorn 1.1 running as k8s scheduler plugin to be most compatible
* cluster-autoscaler V1.21.0
{*}Issues{*}:
Let's say we have quene has be below limit
{code:yaml}
queues:
- name: dev
submitacl: "*"
resources:
max:
memory: 100Gi
vcore: 10
{code}
Then we try to create 4 pods in the `dev` queue each requires 5 cores and 50Gi
memory
Then we are getting 2 pods {{Running}} and 2 pods {{{}Pending{}}}, because the
queue has reached its limit of 10Gi memory and 10 cpus.
We would expect the queued pods to not triggering EKS auto scaling, as they
would not be able to be allocated until other resources have been release in
the queue.
But what we see is that, the Queued pods still trigger the cluster auto-scaling
regardless. As shown in the example below:
{code:java}
Status: Pending
...
Conditions:
Type Status
PodScheduled False
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3m5s yunikorn 0/147 nodes are
available: 147 Pod is not ready for scheduling.
Warning FailedScheduling 3m5s yunikorn 0/147 nodes are
available: 147 Pod is not ready for scheduling.
Normal Scheduling 3m3s yunikorn
yunikorn/dask-user-07ff5f3b-8qjkl8 is queued and waiting for allocation
Normal TriggeredScaleUp 2m53s cluster-autoscaler pod triggered scale-up:
[{eksctl-cluster-nodegroup-spot-xlarge-compute-1-NodeGroup-8VURTD4WKCYV 0->4
(max: 16)}]
{code}
So eventually, EKS auto-added some hosts but not actually been used and
allocated as the pods are not approved to be scheduled yet.
We also tried Gang scheduling with the pods in a task group, but it is also
having similar issues: Even the whole gang is not ready to schedule, Yunikorn
creates the place-holder pods which triggers auto-scaling of EKS cluster
*Causes and potential solutions*
We tried to look at both source code in the auto-scaler and Yunikorn, and we
think the reason is just that the auto-scaler does not know about Yunikorn
specific events and state (Pending but not QuotaApproved) of a Pod. It searches
all the Pods with `PodScheduled=False` to then check whether it needs to add
resources for them.
The issue could be resolved from both side:
- To solve from auto-scaler side, it needs to know the special events and
state of Yunikorn
- To solve from Yunikorn side, I think it needs to not create the pod or at
least not in `Pending` phase until it is quota approved - not sure how hard to
achieve this
** as long as a pod is created, it goes to Pending and the auto-scaler will
try to pick it up
We think solving it from Yunikron side would be cleaner, since auto-scaler
should not need to know the k8s scheduler implementation in order to make a
decision. Also there are other auto-scaler alternatives like AWS Karpenter
could suffers the same issue when interact with Yunikorn.
Wondering whether this issue report make sense to you guys. Let us know if
there are any other solutions and whether it is possible to be solved in future
:)
Thanks a lot!
> Yunikorn triggers EKS auto-scaling even pods requests have exceeded the queue
> limit
> ------------------------------------------------------------------------------------
>
> Key: YUNIKORN-1347
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1347
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler, shim - kubernetes
> Reporter: Anthony Wu
> Priority: Major
>
> Hi guys,
> We are trying to utilise Yunikorn to manage our AWS EKS infrastructure to
> limit resource usage for different users and groups. We also use k8s cluster
> auto-scaler
> ([https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler])
> for auto scaling of the cluster when necessary.
> *Environment*
> * AWS EKS on k8s 1.21
> * Yunikorn 1.1 running as k8s scheduler plugin to be most compatible
> * cluster-autoscaler V1.21.0
> {*}Issues{*}:
> Let's say we have quene has be below limit
> {code:yaml}
> queues:
> - name: dev
> submitacl: "*"
> resources:
> max:
> memory: 100Gi
> vcore: 10
> {code}
>
> Then we try to create 4 pods in the `dev` queue each requires 5 cores and
> 50Gi memory
> Then we are getting 2 pods {{Running}} and 2 pods {{{}Pending{}}}, because
> the queue has reached its limit of 10Gi memory and 10 cpus.
> We would expect the queued pods to not triggering EKS auto scaling, as they
> would not be able to be allocated until other resources have been release in
> the queue.
> But what we see is that, the Queued pods still trigger the cluster
> auto-scaling regardless. As shown in the example below:
> {code:java}
> Status: Pending
> ...
> Conditions:
> Type Status
> PodScheduled False
> Events:
> Type Reason Age From Message
> ---- ------ ---- ---- -------
> Warning FailedScheduling 3m5s yunikorn 0/147 nodes are
> available: 147 Pod is not ready for scheduling.
> Warning FailedScheduling 3m5s yunikorn 0/147 nodes are
> available: 147 Pod is not ready for scheduling.
> Normal Scheduling 3m3s yunikorn
> yunikorn/dask-user-07ff5f3b-8qjkl8 is queued and waiting for allocation
> Normal TriggeredScaleUp 2m53s cluster-autoscaler pod triggered
> scale-up:
> [{eksctl-cluster-nodegroup-spot-xlarge-compute-1-NodeGroup-8VURTD4WKCYV 0->4
> (max: 16)}]
> {code}
> So eventually, EKS auto-added some hosts but not actually been used and
> allocated as the pods are not approved to be scheduled yet.
> We also tried Gang scheduling with the pods in a task group, but it is also
> having similar issues: Even the whole gang is not ready to schedule, Yunikorn
> creates the place-holder pods which triggers auto-scaling of EKS cluster
> *Causes and potential solutions*
> We tried to look at both source code in the auto-scaler and Yunikorn, and we
> think the reason is just that the auto-scaler does not know about Yunikorn
> specific events and state (Pending but not QuotaApproved) of a Pod. It
> searches all the Pods with `PodScheduled=False` to then check whether it
> needs to add resources for them.
> The issue could be resolved from both side:
> - To solve from auto-scaler side, it needs to know the special events and
> state of Yunikorn
> - To solve from Yunikorn side, I think it needs to not create the pod or at
> least not in `Pending` phase until it is quota approved
> ** not sure how hard to achieve this, but as long as a pod is created and it
> goes to Pending then auto-scaler will try to pick it up
> We think solving it from Yunikron side would be cleaner, since auto-scaler
> should not need to know the k8s scheduler implementation in order to make a
> decision. Also there are other auto-scaler alternatives like AWS Karpenter
> could suffers the same issue when interact with Yunikorn.
> Wondering whether this issue report make sense to you guys. Let us know if
> there are any other solutions and whether it is possible to be solved in
> future :)
> Thanks a lot!
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]