[ 
https://issues.apache.org/jira/browse/YUNIKORN-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

he zheng yu updated YUNIKORN-3218:
----------------------------------
    Description: 
* andrew's 1st message:
Hi team, we had an concurrency issue.
we run workloads via argowf step by step sequentially, all pod share the same 
applicationId, scheduled by YK.
when argowf was slow, and heppened tp create pods at an interval of about 30 
mins (create pod for next step 30s after previous pod finishes ), the app-id 
was terminated and removed, just when the next pod of the same applicationId 
arrived, the same old app-id was added again. the concurrency here is not safe, 
we observed orphan tasks/pods whose app-id had already removed, the tasks/pods 
stuck  forever.
could any one help to look into this issue?
 * andrew's argowf to reproduce the issue:

@Po Han Huang Hi, below workflow can reproduce, I succeeded.
you can modify the wait duration, in my company env I use 28s and reproduced 
the issue, on my laptop I use 28400ms and reproduced the issue.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: steps-with-suspend-
  namespace: argo

spec:
  entrypoint: hello-and-wait
  templates:
  - name: hello-and-wait
    steps:
    - - name: hello1
        template: print-message
    - - name: wait1
        template: wait
    - - name: hello2
        template: print-message
    - - name: wait2
        template: wait
    - - name: hello3
        template: print-message

  - name: print-message
    metadata:
      labels:
        applicationId: "{{{}workflow.name{}}}"
        queue: "root.default"
        schedulerName: yunikorn
    container:
      image: busybox
      command: [echo]
      args: ["============================="]

  - name: wait
    suspend:
      duration: "28400ms"
 * log file: 
[https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1769317449159249?thread_ts=1768967309.491029&cid=CLNUW68MU]
 * [~wilfreds]'s last reply:

We allow re-use 30 seconds after the last allocation was removed. Until that 
point in time the application still exists.

This is what happens: * submit a pod -> creates an allocation request
 * an application is generated based on the application ID on the pod or the 
auto generation settings -> application is created
 * pods get submitted referencing the same ID: new allocation requests linked 
to that application ID, normal processing.
 * the application stays in a RUNNING state until all the allocation requests 
have been allocated and have finished
 * if no allocation (pending or running) are linked to the application it moves 
to a COMPLETING state. It can happen that a pods’ exit triggers a new pod 
creation. We wait in COMPLETING state for 30 seconds. Any new pod that comes in 
in that time frame will be tracked under the same application ID. A new 
application with the same ID cannot be created.
 * After the 30 seconds the application will be moved to COMPLETED.
 ** tell the shim the app is done
 ** remove the application from the queue
 ** move app to terminated list to allow reuse
 ** clean up state tracking to reduce memory footprint



The application ID is now free for re-use by a new application.

What happens in your case is that the shim update and a new allocation cross 
each other. The core tells the shim via one channel while the shim tells the 
core via another channel of a change. Neither side do in depth checks to 
prevent issues. The shim blindly removes the app not checking if anything is 
still pending. If it would then that would fail. The core assumes nothing 
changed since the timer was triggered and cleans up too. If it would check it 
would also not do that.

We can make it a bit more robust on the core side with a simple change that 
causes the reject. We should really fix both sides but that is more work which 
would need planning. The real solution is getting rid of the application state 
in the shim. A large undertaking…
[4:49 PM]
Could you file a jira to track all of this?

  was:
* andrew's 1st message:
Hi team, we had an concurrency issue.
we run workloads via argowf step by step sequentially, all pod share the same 
applicationId, scheduled by YK.
when argowf was slow, and heppened tp create pods at an interval of about 30 
mins (create pod for next step 30s after previous pod finishes ), the app-id 
was terminated and removed, just when the next pod of the same applicationId 
arrived, the same old app-id was added again. the concurrency here is not safe, 
we observed orphan tasks/pods whose app-id had already removed, the tasks/pods 
stuck  forever.
could any one help to look into this issue?
 * andrew's argowf to reproduce the issue:

@Po Han Huang Hi, below workflow can reproduce, I succeeded.
you can modify the wait duration, in my company env I use 28s and reproduced 
the issue, on my laptop I use 28400ms and reproduced the issue.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: steps-with-suspend-
  namespace: argo

spec:
  entrypoint: hello-and-wait
  templates:
  - name: hello-and-wait
    steps:
    - - name: hello1
        template: print-message
    - - name: wait1
        template: wait
    - - name: hello2
        template: print-message
    - - name: wait2
        template: wait
    - - name: hello3
        template: print-message

  - name: print-message
    metadata:
      labels:
        applicationId: "\{{workflow.name}}"
        queue: "root.default"
        schedulerName: yunikorn
    container:
      image: busybox
      command: [echo]
      args: ["============================="]

  - name: wait
    suspend:
      duration: "28400ms"

 * log file: 
[https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1769317449159249?thread_ts=1768967309.491029&cid=CLNUW68MU]
 * [~wilfreds]'s last reply:

We allow re-use 30 seconds after the last allocation was removed. Until that 
point in time the application still exists.

This is what happens: * submit a pod -> creates an allocation request
 * an application is generated based on the application ID on the pod or the 
auto generation settings -> application is created
 * pods get submitted referencing the same ID: new allocation requests linked 
to that application ID, normal processing.
 * the application stays in a RUNNING state until all the allocation requests 
have been allocated and have finished
 * if no allocation (pending or running) are linked to the application it moves 
to a COMPLETING state. It can happen that a pods’ exit triggers a new pod 
creation. We wait in COMPLETING state for 30 seconds. Any new pod that comes in 
in that time frame will be tracked under the same application ID. A new 
application with the same ID cannot be created.
 * After the 30 seconds the application will be moved to COMPLETED.
 ** tell the shim the app is done
 ** remove the application from the queue
 ** move app to terminated list to allow reuse
 ** clean up state tracking to reduce memory footprint



The application ID is now free for re-use by a new application.

What happens in your case is that the shim update and a new allocation cross 
each other. The core tells the shim via one channel while the shim tells the 
core via another channel of a change. Neither side do in depth checks to 
prevent issues. The shim blindly removes the app not checking if anything is 
still pending. If it would then that would fail. The core assumes nothing 
changed since the timer was triggered and cleans up too. If it would check it 
would also not do that.

We can make it a bit more robust on the core side with a simple change that 
causes the reject. We should really fix both sides but that is more work which 
would need planning. The real solution is getting rid of the application state 
in the shim. A large undertaking…
[4:49 PM]
Could you file a jira to track all of this?


> applicaiotn-id reusing concurrency issue: remove-add race condition
> -------------------------------------------------------------------
>
>                 Key: YUNIKORN-3218
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-3218
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: he zheng yu
>            Assignee: Peter Bacsko
>            Priority: Critical
>
> * andrew's 1st message:
> Hi team, we had an concurrency issue.
> we run workloads via argowf step by step sequentially, all pod share the same 
> applicationId, scheduled by YK.
> when argowf was slow, and heppened tp create pods at an interval of about 30 
> mins (create pod for next step 30s after previous pod finishes ), the app-id 
> was terminated and removed, just when the next pod of the same applicationId 
> arrived, the same old app-id was added again. the concurrency here is not 
> safe, we observed orphan tasks/pods whose app-id had already removed, the 
> tasks/pods stuck  forever.
> could any one help to look into this issue?
>  * andrew's argowf to reproduce the issue:
> @Po Han Huang Hi, below workflow can reproduce, I succeeded.
> you can modify the wait duration, in my company env I use 28s and reproduced 
> the issue, on my laptop I use 28400ms and reproduced the issue.
> apiVersion: argoproj.io/v1alpha1
> kind: Workflow
> metadata:
>   generateName: steps-with-suspend-
>   namespace: argo
> spec:
>   entrypoint: hello-and-wait
>   templates:
>   - name: hello-and-wait
>     steps:
>     - - name: hello1
>         template: print-message
>     - - name: wait1
>         template: wait
>     - - name: hello2
>         template: print-message
>     - - name: wait2
>         template: wait
>     - - name: hello3
>         template: print-message
>   - name: print-message
>     metadata:
>       labels:
>         applicationId: "{{{}workflow.name{}}}"
>         queue: "root.default"
>         schedulerName: yunikorn
>     container:
>       image: busybox
>       command: [echo]
>       args: ["============================="]
>   - name: wait
>     suspend:
>       duration: "28400ms"
>  * log file: 
> [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1769317449159249?thread_ts=1768967309.491029&cid=CLNUW68MU]
>  * [~wilfreds]'s last reply:
> We allow re-use 30 seconds after the last allocation was removed. Until that 
> point in time the application still exists.
> This is what happens: * submit a pod -> creates an allocation request
>  * an application is generated based on the application ID on the pod or the 
> auto generation settings -> application is created
>  * pods get submitted referencing the same ID: new allocation requests linked 
> to that application ID, normal processing.
>  * the application stays in a RUNNING state until all the allocation requests 
> have been allocated and have finished
>  * if no allocation (pending or running) are linked to the application it 
> moves to a COMPLETING state. It can happen that a pods’ exit triggers a new 
> pod creation. We wait in COMPLETING state for 30 seconds. Any new pod that 
> comes in in that time frame will be tracked under the same application ID. A 
> new application with the same ID cannot be created.
>  * After the 30 seconds the application will be moved to COMPLETED.
>  ** tell the shim the app is done
>  ** remove the application from the queue
>  ** move app to terminated list to allow reuse
>  ** clean up state tracking to reduce memory footprint
> The application ID is now free for re-use by a new application.
> What happens in your case is that the shim update and a new allocation cross 
> each other. The core tells the shim via one channel while the shim tells the 
> core via another channel of a change. Neither side do in depth checks to 
> prevent issues. The shim blindly removes the app not checking if anything is 
> still pending. If it would then that would fail. The core assumes nothing 
> changed since the timer was triggered and cleans up too. If it would check it 
> would also not do that.
> We can make it a bit more robust on the core side with a simple change that 
> causes the reject. We should really fix both sides but that is more work 
> which would need planning. The real solution is getting rid of the 
> application state in the shim. A large undertaking…
> [4:49 PM]
> Could you file a jira to track all of this?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to