[jira] [Commented] (YUNIKORN-584) The node information could become out of sync with the underlying cluster resources

Weiwei Yang (Jira) Thu, 18 Mar 2021 09:53:06 -0700


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304281#comment-17304281
 ]


Weiwei Yang commented on YUNIKORN-584:
--------------------------------------

hi [~yuchaoran2011] did you see any "reject" node messages in the scheduler 
log? I am thinking it might be possible YK crashed for some reason and got 
restarted. If you are using a build before the YUNIKORN-549 has been fixed, 
during the restart there might be nodes being rejected and possibly run into 
the situation you were seeing. It will be helpful to attach some logs for us to 
look at.

> The node information could become out of sync with the underlying cluster 
> resources
> -----------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-584
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-584
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: shim - kubernetes
>            Reporter: Chaoran Yu
>            Priority: Critical
>             Fix For: 0.10
>
>
> There are cases when YK may think that the cluster doesn't have enough 
> resources even though that's not actually the case. This has happened twice 
> to me after running YK in a cluster for a few days and then one day, the 
> [nodes endpoint|https://yunikorn.apache.org/docs/next/api/scheduler#nodes] 
> shows that the cluster only has one node (i.e. the node that YK itself is 
> running on), despite that the K8s cluster has 10 nodes in total. And if I try 
> to schedule a workload that requires more resources than available on that 
> node, YK will make pods pending with an event like below:
> {quote}Normal  PodUnschedulable  41s   yunikorn  Task <namespace>/<pod> is 
> pending for the requested resources become available{quote}
> because it's not aware that other nodes in the cluster has available 
> resources.
> All of this can be fixed by just restarting YK (scaling down the replica to 0 
> and then back up to 1). So it seems that an issue with cache is causing the 
> issue, although it's not yet clear to me the exact conditions that triggered 
> this bug.
> My environment is on AWS EKS with K8s 1.17, if that matters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YUNIKORN-584) The node information could become out of sync with the underlying cluster resources

Reply via email to