[jira] [Created] (YUNIKORN-584) The node information could become out of sync with the underlying cluster resources

Chaoran Yu (Jira) Thu, 18 Mar 2021 05:34:27 -0700

Chaoran Yu created YUNIKORN-584:
-----------------------------------

             Summary: The node information could become out of sync with the 
underlying cluster resources
                 Key: YUNIKORN-584
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-584
             Project: Apache YuniKorn
          Issue Type: Bug
          Components: core - cache, shim - kubernetes
            Reporter: Chaoran Yu
             Fix For: 0.10



There are cases when YK may think that the cluster doesn't have enough 
resources even though that's not actually the case. This has happened twice to 
me after running YK in a cluster for a few days and then one day, the [nodes 
endpoint|https://yunikorn.apache.org/docs/next/api/scheduler#nodes] shows that 
the cluster only has one node (i.e. the node that YK itself is running on), 
despite that the K8s cluster has 10 nodes in total. And if I try to schedule a 
workload that requires more resources than available on that node, YK will make 
pods pending with an event like below:
{quote}Normal  PodUnschedulable  41s   yunikorn  Task <namespace>/<pod> is 
pending for the requested resources become available{quote}
because it's not aware that other nodes in the cluster has available resources.

All of this can be fixed by just restarting YK (scaling down the replica to 0 
and then back up to 1). So it seems that an issue with cache is causing the 
issue, although it's not yet clear to me the exact conditions that triggered 
this bug.

My environment is on AWS EKS with K8s 1.17, if that matters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (YUNIKORN-584) The node information could become out of sync with the underlying cluster resources

Reply via email to