[ 
https://issues.apache.org/jira/browse/YUNIKORN-3089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17983657#comment-17983657
 ] 

Peter Bacsko commented on YUNIKORN-3089:
----------------------------------------

Unfortunately, this bug has been in the codebase for a long time. The thing is, 
we can't safely remove asks from the core as long as the application object is 
not available inside the {{PartitionContext}}. So a quick, almost immediate 
termination of the pod results in accumulating app objects because there is a 
race between app creation vs task removal.

A possible solution which I see: check if the application is in Accepted state 
on the shim side. If not -> add the task to a list of pending tasks that should 
be removed. Then, during state transition or inside 
{{Application.postAppAccepted()}} we call {{task.releaseAllocation()}}. 

> Web UI shows stale "New" state applications that are no longer present in the 
> cluster
> -------------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-3089
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-3089
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: Mit Desai
>            Assignee: Peter Bacsko
>            Priority: Major
>         Attachments: yunikorn-spark-cd26dba9a9d54b2089eafe73562efc4d.log
>
>
> We are experiencing an issue where the YuniKorn Web UI continues to display 
> applications in the *New* state, even though these applications are no longer 
> present in the Kubernetes cluster. The list of such stale applications grows 
> over time while the scheduler is running, and is cleared only upon a 
> scheduler restart. In one instance, we observed this list growing to over 
> 1200+ stale applications.
> This issue is reproducible even with the *1.6.3 build* running with the 
> *YUNIKORN-3084 patch* applied.
> *Steps to Reproduce:*
>  # Create pods that fail immediately due to constraints (e.g., Kyverno policy 
> violations).
>  # Observe in the Web UI that applications remain in the New state even after 
> the pods are deleted from the cluster.
>  # Over time, the list of applications in the New state keeps growing.
>  # Restarting the scheduler resets the list, but the problem reappears as the 
> scheduler continues to run.
> *Obeservations:*
>  * Applications remain in the *New* state in the Web UI, even after their 
> corresponding pods are deleted from the cluster.
>  * The problem appears to be related to the order and timing of create/delete 
> events received by the core.
>  * When a pod fails immediately (e.g., due to Kyverno policy violations), the 
> shim receives both create and delete requests, but the core does not create 
> the app in the partition context in time for the delete to be processed.
>  * The core eventually receives the create request, but not the corresponding 
> delete was received before that, resulting in the application remaining in 
> the New state indefinitely.
>  * The shim does not take any further action, leaving the application in this 
> stale state until a scheduler restart.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

Reply via email to