[jira] [Commented] (YUNIKORN-584) The node information could become out of sync with the underlying cluster resources

Chaoran Yu (Jira) Thu, 18 Mar 2021 06:34:41 -0700


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304147#comment-17304147
 ]


Chaoran Yu commented on YUNIKORN-584:
-------------------------------------

[~wilfreds] Thanks for the reply. Yeah I looked at the code and saw the same 
thing as you described: Node information in YK should only be updated when the 
node informer detects a change. That's why it's baffling for me. Before 
restarting YK, I took a look at the logs but didn't see anything suspicious, 
but maybe I have missed something. Next time I'll make a dump of all the logs 
if I see it again.

> The node information could become out of sync with the underlying cluster 
> resources
> -----------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-584
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-584
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: shim - kubernetes
>            Reporter: Chaoran Yu
>            Priority: Critical
>             Fix For: 0.10
>
>
> There are cases when YK may think that the cluster doesn't have enough 
> resources even though that's not actually the case. This has happened twice 
> to me after running YK in a cluster for a few days and then one day, the 
> [nodes endpoint|https://yunikorn.apache.org/docs/next/api/scheduler#nodes] 
> shows that the cluster only has one node (i.e. the node that YK itself is 
> running on), despite that the K8s cluster has 10 nodes in total. And if I try 
> to schedule a workload that requires more resources than available on that 
> node, YK will make pods pending with an event like below:
> {quote}Normal  PodUnschedulable  41s   yunikorn  Task <namespace>/<pod> is 
> pending for the requested resources become available{quote}
> because it's not aware that other nodes in the cluster has available 
> resources.
> All of this can be fixed by just restarting YK (scaling down the replica to 0 
> and then back up to 1). So it seems that an issue with cache is causing the 
> issue, although it's not yet clear to me the exact conditions that triggered 
> this bug.
> My environment is on AWS EKS with K8s 1.17, if that matters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YUNIKORN-584) The node information could become out of sync with the underlying cluster resources

Reply via email to