[ 
https://issues.apache.org/jira/browse/YUNIKORN-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

he zheng yu updated YUNIKORN-3218:
----------------------------------
    Description: 
* andrew's 1st message:
Hi team, we had an concurrency issue.
we run workloads via argowf step by step sequentially, all pod share the same 
applicationId, scheduled by YK.
when argowf was slow, and heppened tp create pods at an interval of about 30 
mins (create pod for next step 30s after previous pod finishes ), the app-id 
was terminated and removed, just when the next pod of the same applicationId 
arrived, the same old app-id was added again. the concurrency here is not safe, 
we observed orphan tasks/pods whose app-id had already removed, the tasks/pods 
stuck  forever.
could any one help to look into this issue?
 * andrew's argowf to reproduce the issue:

@Po Han Huang Hi, below workflow can reproduce, I succeeded.
you can modify the wait duration, in my company env I use 28s and reproduced 
the issue, on my laptop I use 28400ms and reproduced the issue.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: steps-with-suspend-
  namespace: argo

spec:
  entrypoint: hello-and-wait
  templates:
  - name: hello-and-wait
    steps:
    - - name: hello1
        template: print-message
    - - name: wait1
        template: wait
    - - name: hello2
        template: print-message
    - - name: wait2
        template: wait
    - - name: hello3
        template: print-message

  - name: print-message
    metadata:
      labels:
        applicationId: "\{{workflow.name}}"
        queue: "root.default"
        schedulerName: yunikorn
    container:
      image: busybox
      command: [echo]
      args: ["============================="]

  - name: wait
    suspend:
      duration: "28400ms"

 * log file: 
[https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1769317449159249?thread_ts=1768967309.491029&cid=CLNUW68MU]
 * [~wilfreds]'s last reply:

We allow re-use 30 seconds after the last allocation was removed. Until that 
point in time the application still exists.

This is what happens: * submit a pod -> creates an allocation request
 * an application is generated based on the application ID on the pod or the 
auto generation settings -> application is created
 * pods get submitted referencing the same ID: new allocation requests linked 
to that application ID, normal processing.
 * the application stays in a RUNNING state until all the allocation requests 
have been allocated and have finished
 * if no allocation (pending or running) are linked to the application it moves 
to a COMPLETING state. It can happen that a pods’ exit triggers a new pod 
creation. We wait in COMPLETING state for 30 seconds. Any new pod that comes in 
in that time frame will be tracked under the same application ID. A new 
application with the same ID cannot be created.
 * After the 30 seconds the application will be moved to COMPLETED.
 ** tell the shim the app is done
 ** remove the application from the queue
 ** move app to terminated list to allow reuse
 ** clean up state tracking to reduce memory footprint



The application ID is now free for re-use by a new application.

What happens in your case is that the shim update and a new allocation cross 
each other. The core tells the shim via one channel while the shim tells the 
core via another channel of a change. Neither side do in depth checks to 
prevent issues. The shim blindly removes the app not checking if anything is 
still pending. If it would then that would fail. The core assumes nothing 
changed since the timer was triggered and cleans up too. If it would check it 
would also not do that.

We can make it a bit more robust on the core side with a simple change that 
causes the reject. We should really fix both sides but that is more work which 
would need planning. The real solution is getting rid of the application state 
in the shim. A large undertaking…
[4:49 PM]
Could you file a jira to track all of this?

  was:
We are experiencing an issue where the YuniKorn Web UI continues to display 
applications in the *New* state, even though these applications are no longer 
present in the Kubernetes cluster. The list of such stale applications grows 
over time while the scheduler is running, and is cleared only upon a scheduler 
restart. In one instance, we observed this list growing to over 1200+ stale 
applications.

This issue is reproducible even with the *1.6.3 build* running with the 
*YUNIKORN-3084 patch* applied.

*Steps to Reproduce:*
 # Create pods that fail immediately due to constraints (e.g., Kyverno policy 
violations).
 # Observe in the Web UI that applications remain in the New state even after 
the pods are deleted from the cluster.
 # Over time, the list of applications in the New state keeps growing.
 # Restarting the scheduler resets the list, but the problem reappears as the 
scheduler continues to run.

*Obeservations:*
 * Applications remain in the *New* state in the Web UI, even after their 
corresponding pods are deleted from the cluster.
 * The problem appears to be related to the order and timing of create/delete 
events received by the core.
 * When a pod fails immediately (e.g., due to Kyverno policy violations), the 
shim receives both create and delete requests, but the core does not create the 
app in the partition context in time for the delete to be processed.
 * The core eventually receives the create request, but not the corresponding 
delete was received before that, resulting in the application remaining in the 
New state indefinitely.
 * The shim does not take any further action, leaving the application in this 
stale state until a scheduler restart.

 


> applicaiotn-id reusing concurrency issue: remove-add race condition
> -------------------------------------------------------------------
>
>                 Key: YUNIKORN-3218
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-3218
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: he zheng yu
>            Assignee: Peter Bacsko
>            Priority: Critical
>
> * andrew's 1st message:
> Hi team, we had an concurrency issue.
> we run workloads via argowf step by step sequentially, all pod share the same 
> applicationId, scheduled by YK.
> when argowf was slow, and heppened tp create pods at an interval of about 30 
> mins (create pod for next step 30s after previous pod finishes ), the app-id 
> was terminated and removed, just when the next pod of the same applicationId 
> arrived, the same old app-id was added again. the concurrency here is not 
> safe, we observed orphan tasks/pods whose app-id had already removed, the 
> tasks/pods stuck  forever.
> could any one help to look into this issue?
>  * andrew's argowf to reproduce the issue:
> @Po Han Huang Hi, below workflow can reproduce, I succeeded.
> you can modify the wait duration, in my company env I use 28s and reproduced 
> the issue, on my laptop I use 28400ms and reproduced the issue.
> apiVersion: argoproj.io/v1alpha1
> kind: Workflow
> metadata:
>   generateName: steps-with-suspend-
>   namespace: argo
> spec:
>   entrypoint: hello-and-wait
>   templates:
>   - name: hello-and-wait
>     steps:
>     - - name: hello1
>         template: print-message
>     - - name: wait1
>         template: wait
>     - - name: hello2
>         template: print-message
>     - - name: wait2
>         template: wait
>     - - name: hello3
>         template: print-message
>   - name: print-message
>     metadata:
>       labels:
>         applicationId: "\{{workflow.name}}"
>         queue: "root.default"
>         schedulerName: yunikorn
>     container:
>       image: busybox
>       command: [echo]
>       args: ["============================="]
>   - name: wait
>     suspend:
>       duration: "28400ms"
>  * log file: 
> [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1769317449159249?thread_ts=1768967309.491029&cid=CLNUW68MU]
>  * [~wilfreds]'s last reply:
> We allow re-use 30 seconds after the last allocation was removed. Until that 
> point in time the application still exists.
> This is what happens: * submit a pod -> creates an allocation request
>  * an application is generated based on the application ID on the pod or the 
> auto generation settings -> application is created
>  * pods get submitted referencing the same ID: new allocation requests linked 
> to that application ID, normal processing.
>  * the application stays in a RUNNING state until all the allocation requests 
> have been allocated and have finished
>  * if no allocation (pending or running) are linked to the application it 
> moves to a COMPLETING state. It can happen that a pods’ exit triggers a new 
> pod creation. We wait in COMPLETING state for 30 seconds. Any new pod that 
> comes in in that time frame will be tracked under the same application ID. A 
> new application with the same ID cannot be created.
>  * After the 30 seconds the application will be moved to COMPLETED.
>  ** tell the shim the app is done
>  ** remove the application from the queue
>  ** move app to terminated list to allow reuse
>  ** clean up state tracking to reduce memory footprint
> The application ID is now free for re-use by a new application.
> What happens in your case is that the shim update and a new allocation cross 
> each other. The core tells the shim via one channel while the shim tells the 
> core via another channel of a change. Neither side do in depth checks to 
> prevent issues. The shim blindly removes the app not checking if anything is 
> still pending. If it would then that would fail. The core assumes nothing 
> changed since the timer was triggered and cleans up too. If it would check it 
> would also not do that.
> We can make it a bit more robust on the core side with a simple change that 
> causes the reject. We should really fix both sides but that is more work 
> which would need planning. The real solution is getting rid of the 
> application state in the shim. A large undertaking…
> [4:49 PM]
> Could you file a jira to track all of this?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to