[
https://issues.apache.org/jira/browse/YUNIKORN-2895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887048#comment-17887048
]
Matthew Corley commented on YUNIKORN-2895:
------------------------------------------
I don't want to add noise, but we have seen issues on 1.5.2 and 1.6.0 that are
different from each other but may be related.
On 1.5.2, we find that we regularly have this failing healthcheck:
- Consistency of data: "Check if a partition's allocated resource <= total
resource of the partition"
- we still aren't sure why or this relates to the healtcheck, but we see
occasional abandoned pods – pending but unscheduled despite resources being
available; fixed by either restarting the scheduler or recreating the pod.
On 1.6.1, we regularly encountered:
- Negative resourcesCheck for negative resources in the nodes
- we narrowed this down more specifically to negative "Occupied" value for
non-GPU node:
{noformat}
"nodeID": "gke-cr-west1-nap-n2-standard-32-spot--d75f2703-pp8c",
"attributes": {
"si.io/hostname":
"gke-cr-west1-nap-n2-standard-32-spot--d75f2703-pp8c",
"si.io/rackname": "/rack-default",
"si/node-partition": "[mycluster]default"
},
"capacity": {
"ephemeral-storage": 269082868728,
"hugepages-1Gi": 0,
"hugepages-2Mi": 0,
"memory": 124843028480,
"pods": 110,
"vcore": 31850
},
"allocated": {
"memory": 25769803776,
"pods": 1,
"vcore": 16000
},
"occupied": {
"memory": -956301312,
"pods": -1,
"vcore": -600
},
"available": {
"ephemeral-storage": 269082868728,
"memory": 100029526016,
"pods": 110,
"vcore": 16450
},
"utilized": {
"memory": 20,
"pods": 0,
"vcore": 50
},
"allocations": [
{
"allocationKey": "6daecfea-cf01-422f-bab2-94f6ca94c13a",
"allocationTags": {
"creationTime": "1727210054",
"kubernetes.io/label/applicationId": "fc496c5366f0446c4972",
"kubernetes.io/label/domain": "production",
"kubernetes.io/label/execution-id": "fc496c5366f0446c4972",
"kubernetes.io/label/interruptible": "true",
"kubernetes.io/label/k8s.freenome.net/flyte-user": "troyce",
"kubernetes.io/label/k8s.freenome.net/user": "troyce",
"kubernetes.io/label/node-id": "n55",
"kubernetes.io/label/owner": "comp",
"kubernetes.io/label/project": "mint",
"kubernetes.io/label/queue":
"root.batch.production.mint-production",
"kubernetes.io/label/shard-key": "21",
"kubernetes.io/label/task-name":
"mint-workflows-process-datasets-map-process-single-bam-2bd4e3a6",
"kubernetes.io/label/workflow-name":
"mint-create-feature-vectors-create-feature-vectors-workflow",
"kubernetes.io/meta/namespace": "mint-production",
"kubernetes.io/meta/podName":
"fc496c5366f0446c4972-n1-0-dn0-0-dn1-0-n55-0"
},
"requestTime": 1727210054000000000,
"allocationTime": 1727210054000000000,
"resource": {
"memory": 25769803776,
"pods": 1,
"vcore": 16000
},
"priority": "0",
"nodeId": "gke-cr-west1-nap-n2-standard-32-spot--d75f2703-pp8c",
"applicationId": "fc496c5366f0446c4972"
}
],
"schedulable": true,
"isReserved": false
},{noformat}
In both cases restarting the scheduler seems to fix things but only for a bit.
Some of this is duplicating context from Slack but given limited retention I
thought sticking this here might be useful.
We have reverted 1.6.0 to 1.5.2, and are happy to provide any additional debug
context if that's helpful, just let us know what you need.
> Don't add duplicated allocation to node when the allocation ask fails
> ---------------------------------------------------------------------
>
> Key: YUNIKORN-2895
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2895
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Reporter: Qi Zhu
> Assignee: Qi Zhu
> Priority: Critical
>
> When i try to revisit the new update allocation logic, the potential
> duplicated allocation to node will happen if the allocation already
> allocated. And we try to add the allocation to the node again and don't
> revert it.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]