[
https://issues.apache.org/jira/browse/YUNIKORN-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kinga Marton reassigned YUNIKORN-584:
-------------------------------------
Assignee: Weiwei Yang
> App recovery is skipped when applicationID is not set in pods' label
> --------------------------------------------------------------------
>
> Key: YUNIKORN-584
> URL: https://issues.apache.org/jira/browse/YUNIKORN-584
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: shim - kubernetes
> Reporter: Chaoran Yu
> Assignee: Weiwei Yang
> Priority: Critical
> Labels: pull-request-available
> Fix For: 0.10
>
>
> There are cases when YK may think that the cluster doesn't have enough
> resources even though that's not actually the case. This has happened twice
> to me after running YK in a cluster for a few days and then one day, the
> [nodes endpoint|https://yunikorn.apache.org/docs/next/api/scheduler#nodes]
> shows that the cluster only has one node (i.e. the node that YK itself is
> running on), despite that the K8s cluster has 10 nodes in total. And if I try
> to schedule a workload that requires more resources than available on that
> node, YK will make pods pending with an event like below:
> {quote}Normal PodUnschedulable 41s yunikorn Task <namespace>/<pod> is
> pending for the requested resources become available{quote}
> because it's not aware that other nodes in the cluster has available
> resources.
> All of this can be fixed by just restarting YK (scaling down the replica to 0
> and then back up to 1). So it seems that an issue with cache is causing the
> issue, although it's not yet clear to me the exact conditions that triggered
> this bug.
> My environment is on AWS EKS with K8s 1.17, if that matters.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]