[ 
https://issues.apache.org/jira/browse/YUNIKORN-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888761#comment-17888761
 ] 

Craig Condit commented on YUNIKORN-2910:
----------------------------------------

I've started doing some log analysis of these. I haven't narrowed down root 
cause yet, but this is interesting:
{quote}2024-10-10T22:36:37.882Z        INFO    shim.cache.external     
external/scheduler_cache.go:311 Adding occupied resources to node       
\{"nodeID": "amp-dp-prod-spark-exec-yk-1-node-group-b74a85d-h77rt", 
"namespace": "spark-system", "podName": 
"spark-history-server-deployment-078u1pfr-579dbd4b6d-6p6fz", "occupied": 
"resources:{key:\"ephemeral-storage\" value:{value:5368709120}} 
resources:\{key:\"memory\" value:{value:75161927680}} resources:\{key:\"pods\" 
value:{value:1}} resources:\{key:\"vcore\" value:{value:2000}}"}
2024-10-10T22:36:37.882Z        WARN    core.scheduler.node     
objects/node.go:216     Node update triggered over allocated node       
\{"available": "map[ephemeral-storage:1386189349332 memory:-60014637056 
pods:724 vcore:14200 vpc.amazonaws.com/pod-eni:107]", "total": 
"map[ephemeral-storage:1448466375124 hugepages-1Gi:0 hugepages-2Mi:0 
memory:523482255360 pods:737 vcore:63770 vpc.amazonaws.com/pod-eni:107]", 
"occupied": "map[ephemeral-storage:5368709120 memory:75214356480 pods:6 
vcore:2100]", "allocated": "map[ephemeral-storage:56908316672 
memory:508282535936 pods:7 vcore:47470]"}
{quote}
This would seem to indicate a bug on our end, but in fact it's correct. We 
receive an occupied resource update (for a non-YuniKorn pod) which blows past 
the node limits and overallocates memory on the node by ~ 6 GB. Just prior to 
receiving that, we schedule a bunch of spark executors on that node. Because 
the spark history server is scheduled by a non-YuniKorn scheduler, we have a 
case where two schedulers both try to claim resources on the same node, and we 
over-allocate. There's no avoiding this due to the async nature of 
communication with the API server. What's interesting is that this situation 
never gets resolved. My guess is that KWOK's fake nodes don't reject placements 
with OutOfMemory or OutOfCPU like normal nodes. We don't see the allocations go 
away until the node is decommissioned later. In a real cluster, the pod 
rejections come back almost immediately.

> Data corruption due to insufficient shim context locking
> --------------------------------------------------------
>
>                 Key: YUNIKORN-2910
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2910
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: shim - kubernetes
>    Affects Versions: 1.6.0
>            Reporter: Craig Condit
>            Assignee: Craig Condit
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.7.0, 1.6.1
>
>         Attachments: logs-1.6.0+2910, logs-1.6.0+2910+scale-down, 
> state-dump-1.6-context-locking-after-2.json, state-dump-after-1.5.2.json, 
> state-dump-after-1.6.0+2910.json
>
>
> We need to restore the context locking that was removed in YUNIKORN-2629. 
> Without it, multiple K8s events of different types may be processed in 
> parallel. Specifically, pod and node events being processed simultaneously is 
> not safe, and results in data corruption.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to