[jira] [Commented] (YUNIKORN-1173) Basic scheduling fails on an existing cluster

Craig Condit (Jira) Wed, 06 Apr 2022 18:41:07 -0700


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518537#comment-17518537
 ]


Craig Condit commented on YUNIKORN-1173:
----------------------------------------

The first thing that stands out is that this appears to be a very large pod (32 
GB RAM), but the queue capacities are all very small. You might need to check 
your queue configuration as the base unit of measurement of memory has changed 
from 1,000,000 bytes in older versions to bytes in newer versions, while CPU 
cores are now stored internally as millicores. This was done to avoid 
mismatches between resource consumption when interpreted by K8s vs. YuniKorn, 
but it is a breaking change.

The queue configuration has been updated to support K8s-style suffixes as well.

>From [https://yunikorn.apache.org/docs/next/user_guide/queue_config#resources]:
{noformat}
An optional suffix may be specified for resource quantities. Valid suffixes are 
k, M, G, T, P, and E for SI powers of 10, and Ki, Mi, Gi, Ti, Pi, and Ei for SI 
powers of 2. Additionally, resources of type vcore may have a suffix of m to 
indicate millicores. For example, 500m is 50% of a vcore. Units of type memory 
are interpreted in bytes by default. All other resource types have no 
designated base unit.

Note that this is a behavioral change as of YuniKorn 1.0. Prior versions 
interpreted memory as units of 1000000 (1 million) bytes and vcore as 
millicores.{noformat}
I suspect your queue configuration is undersizing RAM by a factor of 1 million 
or so.

> Basic scheduling fails on an existing cluster
> ---------------------------------------------
>
>                 Key: YUNIKORN-1173
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1173
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: shim - kubernetes
>            Reporter: Chaoran Yu
>            Priority: Blocker
>         Attachments: logs.txt, statedump.txt
>
>
> Environment: EKS K8s 1.20. 
> K8shim built based on commit: 
> [https://github.com/apache/yunikorn-k8shim/commit/be3bb70d9757b27d0c40d446306b928c79c80a9f]
> Core version used: v0.0.0-20220325135453-73d55282f052
> After YuniKorn is deployed, I deleted one of the pods managed by K8s 
> deployment, but YK didn't schedule the new pod that's created: 
> *spo-og60-03-spark-operator-86cc7ff747-9vzxl* 
> is the name of the new pod. It's stuck in pending and its event said 
> "spark-operator/spo-og60-03-spark-operator-86cc7ff747-9vzxl is queued and 
> waiting for allocation"
> State dump and scheduler logs are attached



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YUNIKORN-1173) Basic scheduling fails on an existing cluster

Reply via email to