[ 
https://issues.apache.org/jira/browse/YUNIKORN-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17840772#comment-17840772
 ] 

Peter Bacsko commented on YUNIKORN-2526:
----------------------------------------

[~shravan-achar] please upload Yunikorn logs that captures the time frame since 
the restart until the problem appears. We need to examine what happens to the 
allocations like "49f01ed0-3473-4521-b11f-80e27adb7250", 
"2be04314-bed0-4385-9ae7-50ed0ef9d9d5". These are pod UIDs which are no longer 
in the scheduler cache. My theory is that these pods have been deleted in the 
meantime, but the update failed to reach the scheduler core so it keeps 
attempting to schedule them.

> Discrepancy between shim cache and core app/task list after scheduler restart
> -----------------------------------------------------------------------------
>
>                 Key: YUNIKORN-2526
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2526
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: shim - kubernetes
>            Reporter: Shravan Achar
>            Assignee: Peter Bacsko
>            Priority: Major
>         Attachments: log-snippet.txt, state-dump-4-1-3.json, 
> state-dump-4-17.json.zip
>
>
> When scheduler restarts, occasionally it gets into a situation where the 
> application is still in Running state despite the application getting 
> terminated in the cluster. This is confirmed with the attached state dump.
>  
> The scheduler core logs indicate all nodes are being evaluated for 
> non-existing application (also attached). The CPU is being used up doing this 
> unneeded evaluation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to