[
https://issues.apache.org/jira/browse/YUNIKORN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888761#comment-17888761
]
Craig Condit commented on YUNIKORN-2910:
----------------------------------------
I've started doing some log analysis of these. I haven't narrowed down root
cause yet, but this is interesting:
{quote}2024-10-10T22:36:37.882Z INFO shim.cache.external
external/scheduler_cache.go:311 Adding occupied resources to node
\{"nodeID": "amp-dp-prod-spark-exec-yk-1-node-group-b74a85d-h77rt",
"namespace": "spark-system", "podName":
"spark-history-server-deployment-078u1pfr-579dbd4b6d-6p6fz", "occupied":
"resources:{key:\"ephemeral-storage\" value:{value:5368709120}}
resources:\{key:\"memory\" value:{value:75161927680}} resources:\{key:\"pods\"
value:{value:1}} resources:\{key:\"vcore\" value:{value:2000}}"}
2024-10-10T22:36:37.882Z WARN core.scheduler.node
objects/node.go:216 Node update triggered over allocated node
\{"available": "map[ephemeral-storage:1386189349332 memory:-60014637056
pods:724 vcore:14200 vpc.amazonaws.com/pod-eni:107]", "total":
"map[ephemeral-storage:1448466375124 hugepages-1Gi:0 hugepages-2Mi:0
memory:523482255360 pods:737 vcore:63770 vpc.amazonaws.com/pod-eni:107]",
"occupied": "map[ephemeral-storage:5368709120 memory:75214356480 pods:6
vcore:2100]", "allocated": "map[ephemeral-storage:56908316672
memory:508282535936 pods:7 vcore:47470]"}
{quote}
This would seem to indicate a bug on our end, but in fact it's correct. We
receive an occupied resource update (for a non-YuniKorn pod) which blows past
the node limits and overallocates memory on the node by ~ 6 GB. Just prior to
receiving that, we schedule a bunch of spark executors on that node. Because
the spark history server is scheduled by a non-YuniKorn scheduler, we have a
case where two schedulers both try to claim resources on the same node, and we
over-allocate. There's no avoiding this due to the async nature of
communication with the API server. What's interesting is that this situation
never gets resolved. My guess is that KWOK's fake nodes don't reject placements
with OutOfMemory or OutOfCPU like normal nodes. We don't see the allocations go
away until the node is decommissioned later. In a real cluster, the pod
rejections come back almost immediately.
> Data corruption due to insufficient shim context locking
> --------------------------------------------------------
>
> Key: YUNIKORN-2910
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2910
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: shim - kubernetes
> Affects Versions: 1.6.0
> Reporter: Craig Condit
> Assignee: Craig Condit
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 1.7.0, 1.6.1
>
> Attachments: logs-1.6.0+2910, logs-1.6.0+2910+scale-down,
> state-dump-1.6-context-locking-after-2.json, state-dump-after-1.5.2.json,
> state-dump-after-1.6.0+2910.json
>
>
> We need to restore the context locking that was removed in YUNIKORN-2629.
> Without it, multiple K8s events of different types may be processed in
> parallel. Specifically, pod and node events being processed simultaneously is
> not safe, and results in data corruption.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]