[
https://issues.apache.org/jira/browse/YUNIKORN-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Craig Condit resolved YUNIKORN-1615.
------------------------------------
Fix Version/s: 1.5.0
Resolution: Delivered
Resolved as part of YUNIKORN-2180.
> Node occupied resource is negative
> ----------------------------------
>
> Key: YUNIKORN-1615
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1615
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: shim - kubernetes
> Affects Versions: 1.1.0, 1.2.0, 1.3.0, 1.4.0
> Environment: Kubekubernetes 1.20.8
> Reporter: Jie Ke
> Assignee: Craig Condit
> Priority: Major
> Labels: blocked, pull-request-available
> Fix For: 1.5.0
>
> Attachments: fullstatedump.json, image-2023-03-02-11-23-34-052.png,
> image-2023-03-02-11-25-14-484.png
>
>
> After some tasks complete, the Yunikorn scheduler reported node used resource
> with negative resource and it cause the scheduling in chaos. I tried to
> restart the scheduler and it will report negative resource eventually after
> complete some tasks. In Yunikorn scheduler log I found the following log:
> {code:java}
> 2023-03-01T18:10:40.038Z INFO cache/nodes.go:140 report
> occupied resources updates {"node": "172.18.45.234", "request":
> {"nodes":[{"nodeID":"172.18.45.234","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":520021631754},"hugepages-1Gi":{},"hugepages-2Mi":{},"memory":{"value":201131376640},"pods":{"value":110},"vcore":{"value":40000}}},"occupiedResource":{"resources":{"memory":{"value":-10126244160},"vcore":{"value":-9700}}}}],"rmID":"k8s_dios"}}
> 2023-03-01T18:10:44.635Z INFO cache/nodes.go:140 report
> occupied resources updates {"node": "172.18.45.228", "request":
> {"nodes":[{"nodeID":"172.18.45.228","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":249175645796},"hugepages-1Gi":{},"hugepages-2Mi":{},"memory":{"value":269682475008},"pods":{"value":110},"vcore":{"value":40000}}},"occupiedResource":{"resources":{"memory":{"value":-10314987840},"vcore":{"value":-9400}}}}],"rmID":"k8s_dios"}}
> 2023-03-01T18:10:44.870Z INFO cache/nodes.go:140 report
> occupied resources updates {"node": "172.18.45.230", "request":
> {"nodes":[{"nodeID":"172.18.45.230","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":249175645796},"hugepages-1Gi":{},"hugepages-2Mi":{},"memory":{"value":269682475008},"pods":{"value":110},"vcore":{"value":40000}}},"occupiedResource":{"resources":{"memory":{"value":-8829204224},"vcore":{"value":-8500}}}}],"rmID":"k8s_dios"}}
> 2023-03-01T18:10:49.279Z INFO cache/nodes.go:140 report
> occupied resources updates {"node": "172.18.45.235", "request":
> {"nodes":[{"nodeID":"172.18.45.235","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":520021631754},"hugepages-1Gi":{},"hugepages-2Mi":{},"memory":{"value":201131372544},"pods":{"value":110},"vcore":{"value":40000}}},"occupiedResource":{"resources":{"memory":{"value":-8504048512},"vcore":{"value":-7800}}}}],"rmID":"k8s_dios"}}
> 2023-03-01T18:15:42.686Z INFO cache/nodes.go:140 report
> occupied resources updates {"node": "172.18.45.230", "request":
> {"nodes":[{"nodeID":"172.18.45.230","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":249175645796},"hugepages-1Gi":{},"hugepages-2Mi":{},"memory":{"value":269682475008},"pods":{"value":110},"vcore":{"value":40000}}},"occupiedResource":{"resources":{"memory":{"value":-9902946048},"vcore":{"value":-9500}}}}],"rmID":"k8s_dios"}}
> 2023-03-01T18:15:43.857Z INFO cache/nodes.go:140 report
> occupied resources updates {"node": "172.18.45.234", "request":
> {"nodes":[{"nodeID":"172.18.45.234","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":520021631754},"hugepages-1Gi":{},"hugepages-2Mi":{},"memory":{"value":201131376640},"pods":{"value":110},"vcore":{"value":40000}}},"occupiedResource":{"resources":{"memory":{"value":-11199985984},"vcore":{"value":-10700}}}}],"rmID":"k8s_dios"}}
> 2023-03-01T18:15:49.229Z INFO cache/nodes.go:140 report
> occupied resources updates {"node": "172.18.45.235", "request":
> {"nodes":[{"nodeID":"172.18.45.235","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":520021631754},"hugepages-1Gi":{},"hugepages-2Mi":{},"memory":{"value":201131372544},"pods":{"value":110},"vcore":{"value":40000}}},"occupiedResource":{"resources":{"memory":{"value":-9577790336},"vcore":{"value":-8800}}}}],"rmID":"k8s_dios"}}
> 2023-03-01T18:15:54.457Z INFO cache/nodes.go:140 report
> occupied resources updates {"node": "172.18.45.228", "request":
> {"nodes":[{"nodeID":"172.18.45.228","action":2,"attributes":{"ready":"true"},"schedulableResource":{"resources":{"ephemeral-storage":{"value":249175645796},"hugepages-1Gi":{},"hugepages-2Mi":{},"memory":{"value":269682475008},"pods":{"value":110},"vcore":{"value":40000}}},"occupiedResource":{"resources":{"memory":{"value":-11388729664},"vcore":{"value":-10400}}}}],"rmID":"k8s_dios"}}{code}
> Yunikorn UI
> !image-2023-03-02-11-23-34-052.png!
>
> Health Check Result & Log
> !image-2023-03-02-11-25-14-484.png!
>
> {code:java}
> 2023-03-02T03:25:52.310Z WARN scheduler/health_checker.go:176
> Scheduler is not healthy {"health check values": [{"Name":"Scheduling
> errors","Succeeded":true,"Description":"Check for scheduling error entries in
> metrics","DiagnosisMessage":"There were 0 scheduling errors logged in the
> metrics"},{"Name":"Failed nodes","Succeeded":true,"Description":"Check for
> failed nodes entries in metrics","DiagnosisMessage":"There were 0 failed
> nodes logged in the metrics"},{"Name":"Negative
> resources","Succeeded":true,"Description":"Check for negative resources in
> the partitions","DiagnosisMessage":"Partitions with negative resources:
> []"},{"Name":"Negative resources","Succeeded":false,"Description":"Check for
> negative resources in the nodes","DiagnosisMessage":"Nodes with negative
> resources: [\"172.18.45.228\" \"172.18.45.235\" \"172.18.45.234\"
> \"172.18.45.230\"]"},{"Name":"Consistency of
> data","Succeeded":true,"Description":"Check if a node's allocated resource <=
> total resource of the node","DiagnosisMessage":"Nodes with inconsistent data:
> []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if
> total partition resource == sum of the node resources from the
> partition","DiagnosisMessage":"Partitions with inconsistent data:
> []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if
> node total resource = allocated resource + occupied resource + available
> resource","DiagnosisMessage":"Nodes with inconsistent data:
> []"},{"Name":"Consistency of data","Succeeded":true,"Description":"Check if
> node capacity >= allocated resources on the node","DiagnosisMessage":"Nodes
> with inconsistent data: []"},{"Name":"Reservation
> check","Succeeded":true,"Description":"Check the reservation nr compared to
> the number of nodes","DiagnosisMessage":"Reservation/node nr ratio:
> [0.000000]"},{"Name":"Orphan allocation on node
> check","Succeeded":true,"Description":"Check if there are orphan allocations
> on the nodes","DiagnosisMessage":"Orphan allocations: []"},{"Name":"Orphan
> allocation on app check","Succeeded":true,"Description":"Check if there are
> orphan allocations on the
> applications","DiagnosisMessage":"OrphanAllocations: []"}]} {code}
>
>
> Kubekubernetes version
> Server Version: [version.Info|http://version.info/]
> {Major:"1", Minor:"20", GitVersion:"v1.20.8",
> GitCommit:"5575935422cc1cf5169dfc8847cb587aa47bac5a", GitTreeState:"clean",
> BuildDate:"2021-06-16T12:53:07Z", GoVersion:"go1.15.13", Compiler:"gc",
> Platform:"linux/amd64"}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]