[ 
https://issues.apache.org/jira/browse/YUNIKORN-3089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko reassigned YUNIKORN-3089:
--------------------------------------

    Assignee: Peter Bacsko

> Web UI shows stale "New" state applications that are no longer present in the 
> cluster
> -------------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-3089
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-3089
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: Mit Desai
>            Assignee: Peter Bacsko
>            Priority: Major
>         Attachments: yunikorn-spark-cd26dba9a9d54b2089eafe73562efc4d.log
>
>
> We are experiencing an issue where the YuniKorn Web UI continues to display 
> applications in the *New* state, even though these applications are no longer 
> present in the Kubernetes cluster. The list of such stale applications grows 
> over time while the scheduler is running, and is cleared only upon a 
> scheduler restart. In one instance, we observed this list growing to over 
> 1200+ stale applications.
> This issue is reproducible even with the *1.6.3 build* running with the 
> *YUNIKORN-3084 patch* applied.
> *Steps to Reproduce:*
>  # Create pods that fail immediately due to constraints (e.g., Kyverno policy 
> violations).
>  # Observe in the Web UI that applications remain in the New state even after 
> the pods are deleted from the cluster.
>  # Over time, the list of applications in the New state keeps growing.
>  # Restarting the scheduler resets the list, but the problem reappears as the 
> scheduler continues to run.
> *Obeservations:*
>  * Applications remain in the *New* state in the Web UI, even after their 
> corresponding pods are deleted from the cluster.
>  * The problem appears to be related to the order and timing of create/delete 
> events received by the core.
>  * When a pod fails immediately (e.g., due to Kyverno policy violations), the 
> shim receives both create and delete requests, but the core does not create 
> the app in the partition context in time for the delete to be processed.
>  * The core eventually receives the create request, but not the corresponding 
> delete was received before that, resulting in the application remaining in 
> the New state indefinitely.
>  * The shim does not take any further action, leaving the application in this 
> stale state until a scheduler restart.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: issues-h...@yunikorn.apache.org

Reply via email to