[
https://issues.apache.org/jira/browse/YUNIKORN-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304462#comment-17304462
]
Chaoran Yu commented on YUNIKORN-584:
-------------------------------------
Talked to Weiwei offline. From the logs I saw 9 of these messages:
bq. INFO scheduler/context.go:603 Failed to add node to partition
(rejected) {"nodeID": "ip-172-19-195-211.us-west-2.compute.internal",
"partitionName": "[mycluster]default", "error": "failed to find application
spark-0149c2a5c5f943d49f5b68e634972fe0"}
where each message corresponds to one of the missing nodes (As I described
above, my cluster has 10 nodes and YK saw only one when the bug happened).
Another context is that I've been running Spark jobs using gang scheduling with
placeholder pods.
> The node information could become out of sync with the underlying cluster
> resources
> -----------------------------------------------------------------------------------
>
> Key: YUNIKORN-584
> URL: https://issues.apache.org/jira/browse/YUNIKORN-584
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: shim - kubernetes
> Reporter: Chaoran Yu
> Priority: Critical
> Fix For: 0.10
>
>
> There are cases when YK may think that the cluster doesn't have enough
> resources even though that's not actually the case. This has happened twice
> to me after running YK in a cluster for a few days and then one day, the
> [nodes endpoint|https://yunikorn.apache.org/docs/next/api/scheduler#nodes]
> shows that the cluster only has one node (i.e. the node that YK itself is
> running on), despite that the K8s cluster has 10 nodes in total. And if I try
> to schedule a workload that requires more resources than available on that
> node, YK will make pods pending with an event like below:
> {quote}Normal PodUnschedulable 41s yunikorn Task <namespace>/<pod> is
> pending for the requested resources become available{quote}
> because it's not aware that other nodes in the cluster has available
> resources.
> All of this can be fixed by just restarting YK (scaling down the replica to 0
> and then back up to 1). So it seems that an issue with cache is causing the
> issue, although it's not yet clear to me the exact conditions that triggered
> this bug.
> My environment is on AWS EKS with K8s 1.17, if that matters.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]