[jira] [Commented] (YUNIKORN-2526) Discrepancy between shim cache and core app/task list after scheduler restart

Shravan Achar (Jira) Wed, 17 Apr 2024 16:03:48 -0700


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17838404#comment-17838404
 ]


Shravan Achar commented on YUNIKORN-2526:
-----------------------------------------

Another occurence of this was seen. The scheduler happened to restart when some 
large applications were being torned down. The scheduler is busy trying to 
process non-existent pod UIDs, while all new pods are stuck in pending. The 
state dump is attached.

The logs of this nature accumulate over time
{quote}{{2024-04-17T21:25:34.641Z DEBUG core.scheduler.application 
objects/application.go:1442 app reservation check \{"allocationKey": 
"49f01ed0-3473-4521-b11f-80e27adb7250", "createTime": 
"2024-04-17T21:15:53.175Z", "askAge": "9m41.465257923s", "reservationDelay": 
"2s"} 2024-04-17T21:25:34.641Z DEBUG core.scheduler.node objects/node.go:403 
running predicates failed \{"allocationKey": 
"49f01ed0-3473-4521-b11f-80e27adb7250", "nodeID": "kwok-node-4gf2g", 
"allocateFlag": true, "error": "predicates were not running because pod or node 
was not found in cache"} 2024-04-17T21:25:34.641Z DEBUG 
core.scheduler.application objects/application.go:1442 app reservation check 
\{"allocationKey": "49f01ed0-3473-4521-b11f-80e27adb7250", "createTime": 
"2024-04-17T21:15:53.175Z", "askAge": "9m41.46531035s", "reservationDelay": 
"2s"} 2024-04-17T21:25:34.641Z DEBUG core.scheduler.node objects/node.go:403 
running predicates failed \{"allocationKey": 
"49f01ed0-3473-4521-b11f-80e27adb7250", "nodeID": "kwok-node-4hmxt", 
"allocateFlag": true, "error": "predicates were not running because pod or node 
was not found in cache"} 2024-04-17T21:25:34.641Z DEBUG 
core.scheduler.application objects/application.go:1442 app reservation check 
\{"allocationKey": "49f01ed0-3473-4521-b11f-80e27adb7250", "createTime": 
"2024-04-17T21:15:53.175Z", "askAge": "9m41.465362234s", "reservationDelay": 
"2s"} 2024-04-17T21:25:34.641Z DEBUG core.scheduler.node objects/node.go:403 
running predicates failed \{"allocationKey": 
"49f01ed0-3473-4521-b11f-80e27adb7250", "nodeID": "kwok-node-4kls4", 
"allocateFlag": true, "error": "predicates were not running because pod or node 
was not found in cache"} 2024-04-17T21:25:34.641Z DEBUG 
core.scheduler.application objects/application.go:1442 app reservation check 
\{"allocationKey": "49f01ed0-3473-4521-b11f-80e27adb7250", "createTime": 
"2024-04-17T21:15:53.175Z", "askAge": "9m41.465412027s", "reservationDelay": 
"2s"}}}
{quote}
{{The same node seems to get evaluated multiple times }}{{}}

{{2024-04-17T21:32:19.446Z DEBUG core.scheduler.node objects/node.go:403 
running predicates failed \{"allocationKey": 
"2be04314-bed0-4385-9ae7-50ed0ef9d9d5", "nodeID": "kwok-node-zzn7w", 
"allocateFlag": true, "error": "predicates were not running because pod or node 
was not found in cache"}}}

{{2024-04-17T21:33:24.417Z DEBUG core.scheduler.node objects/node.go:403 
running predicates failed \{"allocationKey": 
"2be04314-bed0-4385-9ae7-50ed0ef9d9d5", "nodeID": "kwok-node-zzn7w", 
"allocateFlag": true, "error": "predicates were not running because pod or node 
was not found in cache"}}}

{{2024-04-17T21:33:24.417Z DEBUG core.scheduler.node objects/node.go:403 
running predicates failed \{"allocationKey": 
"2be04314-bed0-4385-9ae7-50ed0ef9d9d5", "nodeID": "kwok-node-zzn7w", 
"allocateFlag": true, "error": "predicates were not running because pod or node 
was not found in cache"}}}

> Discrepancy between shim cache and core app/task list after scheduler restart
> -----------------------------------------------------------------------------
>
>                 Key: YUNIKORN-2526
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2526
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: shim - kubernetes
>            Reporter: Shravan Achar
>            Priority: Major
>         Attachments: log-snippet.txt, state-dump-4-1-3.json, 
> state-dump-4-17.json.zip
>
>
> When scheduler restarts, occasionally it gets into a situation where the 
> application is still in Running state despite the application getting 
> terminated in the cluster. This is confirmed with the attached state dump.
>  
> The scheduler core logs indicate all nodes are being evaluated for 
> non-existing application (also attached). The CPU is being used up doing this 
> unneeded evaluation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YUNIKORN-2526) Discrepancy between shim cache and core app/task list after scheduler restart

Reply via email to