[ 
https://issues.apache.org/jira/browse/YUNIKORN-2895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887048#comment-17887048
 ] 

Matthew Corley commented on YUNIKORN-2895:
------------------------------------------

I don't want to add noise, but we have seen issues on 1.5.2 and 1.6.0 that are 
different from each other but may be related.

On 1.5.2, we find that we regularly have this failing healthcheck:
- Consistency of data:  "Check if a partition's allocated resource <= total 
resource of the partition"
- we still aren't sure why or this relates to the healtcheck, but we see 
occasional abandoned pods – pending but unscheduled despite resources being 
available; fixed by either restarting the scheduler or recreating the pod.

On 1.6.1, we regularly encountered:
- Negative resourcesCheck for negative resources in the nodes
- we narrowed this down more specifically to negative "Occupied" value for 
non-GPU node:
{noformat}
          "nodeID": "gke-cr-west1-nap-n2-standard-32-spot--d75f2703-pp8c",
          "attributes": {
            "si.io/hostname": 
"gke-cr-west1-nap-n2-standard-32-spot--d75f2703-pp8c",
            "si.io/rackname": "/rack-default",
            "si/node-partition": "[mycluster]default"
          },
          "capacity": {
            "ephemeral-storage": 269082868728,
            "hugepages-1Gi": 0,
            "hugepages-2Mi": 0,
            "memory": 124843028480,
            "pods": 110,
            "vcore": 31850
          },
          "allocated": {
            "memory": 25769803776,
            "pods": 1,
            "vcore": 16000
          },
          "occupied": {
            "memory": -956301312,
            "pods": -1,
            "vcore": -600
          },
          "available": {
            "ephemeral-storage": 269082868728,
            "memory": 100029526016,
            "pods": 110,
            "vcore": 16450
          },
          "utilized": {
            "memory": 20,
            "pods": 0,
            "vcore": 50
          },
          "allocations": [
            {
              "allocationKey": "6daecfea-cf01-422f-bab2-94f6ca94c13a",
              "allocationTags": {
                "creationTime": "1727210054",
                "kubernetes.io/label/applicationId": "fc496c5366f0446c4972",
                "kubernetes.io/label/domain": "production",
                "kubernetes.io/label/execution-id": "fc496c5366f0446c4972",
                "kubernetes.io/label/interruptible": "true",
                "kubernetes.io/label/k8s.freenome.net/flyte-user": "troyce",
                "kubernetes.io/label/k8s.freenome.net/user": "troyce",
                "kubernetes.io/label/node-id": "n55",
                "kubernetes.io/label/owner": "comp",
                "kubernetes.io/label/project": "mint",
                "kubernetes.io/label/queue": 
"root.batch.production.mint-production",
                "kubernetes.io/label/shard-key": "21",
                "kubernetes.io/label/task-name": 
"mint-workflows-process-datasets-map-process-single-bam-2bd4e3a6",
                "kubernetes.io/label/workflow-name": 
"mint-create-feature-vectors-create-feature-vectors-workflow",
                "kubernetes.io/meta/namespace": "mint-production",
                "kubernetes.io/meta/podName": 
"fc496c5366f0446c4972-n1-0-dn0-0-dn1-0-n55-0"
              },
              "requestTime": 1727210054000000000,
              "allocationTime": 1727210054000000000,
              "resource": {
                "memory": 25769803776,
                "pods": 1,
                "vcore": 16000
              },
              "priority": "0",
              "nodeId": "gke-cr-west1-nap-n2-standard-32-spot--d75f2703-pp8c",
              "applicationId": "fc496c5366f0446c4972"
            }
          ],
          "schedulable": true,
          "isReserved": false
        },{noformat}
In both cases restarting the scheduler seems to fix things but only for a bit.

Some of this is duplicating context from Slack but given limited retention I 
thought sticking this here might be useful.

We have reverted 1.6.0 to 1.5.2, and are happy to provide any additional debug 
context if that's helpful, just let us know what you need.

 

> Don't add duplicated allocation to node when the allocation ask fails
> ---------------------------------------------------------------------
>
>                 Key: YUNIKORN-2895
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2895
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: Qi Zhu
>            Assignee: Qi Zhu
>            Priority: Critical
>
> When i try to revisit the new update allocation logic, the potential 
> duplicated allocation to node will happen if the allocation already 
> allocated.  And we try to add the allocation to the node again and don't 
> revert it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to