Anthony Wu created YUNIKORN-1347:
------------------------------------

             Summary: Yunikorn triggers EKS auto-scaling when pods have 
surpassed the queue limit 
                 Key: YUNIKORN-1347
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1347
             Project: Apache YuniKorn
          Issue Type: Bug
          Components: core - scheduler, shim - kubernetes
            Reporter: Anthony Wu


Hi guys,

We are trying to utilise Yunikorn to manage our AWS EKS infrastructure to limit 
resource usage for different users and groups. We also use k8s cluster 
auto-scaler 
([https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler]) for 
auto scaling of the cluster when necessary.

*Environment*
 * AWS EKS on k8s 1.21
 * Yunikorn 1.1 running as k8s scheduler plugin to be most compatible
 * cluster-autoscaler V1.21.0

*Issues*:

Let's say we have quene has be below limit
{code:yaml}
queues:               
- name: dev
  submitacl: "*"
  resources:
    max: 
      memory: 100Gi
      vcore: 10 
{code}
 

Then we try to create 4 pods in the `dev` queue each requires 5 cores and 50Gi 
memory

Then we are getting 2 pods {{Running}} and 2 pods {{Pending}}, because the 
queue has reached its limit of 10Gi memory and 10 cpus.

We would expect the queued pods to not triggering EKS auto scaling, as they 
would not be able to be allocated until other resources have been release in 
the queue. 

But what we see is that, the Queued pods still trigger the cluster auto-scaling 
regardless. As shown in the example below:

{code:java}
Status:       Pending
...
Conditions:
  Type           Status
  PodScheduled   False
Events:
  Type     Reason            Age    From                Message
  ----     ------            ----   ----                -------
  Warning  FailedScheduling  3m5s   yunikorn            0/147 nodes are 
available: 147 Pod is not ready for scheduling.
  Warning  FailedScheduling  3m5s   yunikorn            0/147 nodes are 
available: 147 Pod is not ready for scheduling.
  Normal   Scheduling        3m3s   yunikorn            
yunikorn/dask-user-07ff5f3b-8qjkl8 is queued and waiting for allocation
  Normal   TriggeredScaleUp  2m53s  cluster-autoscaler  pod triggered scale-up: 
[{eksctl-cluster-nodegroup-spot-xlarge-compute-1-NodeGroup-8VURTD4WKCYV 0->4 
(max: 16)}]
{code}

So eventually, EKS auto-added some hosts but not actually been used and 
allocated as the pods are not approved to be scheduled yet.

We also tried Gang scheduling with the pods in a task group, but it is also 
having similar issues: Even the whole gang is not ready to schedule, Yunikorn 
creates the place-holder pods which triggers auto-scaling of EKS cluster

*Causes and potential solutions*
We tried to look at both source code in the auto-scaler and Yunikorn, and we 
think the reason is just that the auto-scaler does not know about Yunikorn 
specific events and state (Pending but not QuotaApproved) of a Pod. It searches 
all the Pods with `PodScheduled=False` to then check whether it needs to add 
resources for them.

The issue could be resolved from both side:
- To solve from auto-scaler side, it needs to know the special events and state 
of Yunikorn
- To solve from Yunikorn side, I think it needs to not create the pod unless it 
is quota approved - not sure how hard to achieve this
   - as long as a pod is created, it goes to Pending and the auto-scaler will 
try to pick it up

We think solving it from Yunikron side would be cleaner, since auto-scaler 
should not need to know the k8s scheduler implementation in order to make a 
decision. Also there are other auto-scaler alternatives like AWS Karpenter 
could suffers the same issue when interact with Yunikorn.

Wondering whether this issue report make sense to you guys. Let us know if 
there are any other solutions and whether it is possible to be solved in future 
:)

Thanks a lot!
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to