[ 
https://issues.apache.org/jira/browse/YUNIKORN-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg reassigned YUNIKORN-3218:
-----------------------------------------------

    Assignee:     (was: Peter Bacsko)

> applicaiotn-id reusing concurrency issue: remove-add race condition
> -------------------------------------------------------------------
>
>                 Key: YUNIKORN-3218
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-3218
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: he zheng yu
>            Priority: Critical
>
> * andrew's 1st message:
> Hi team, we had an concurrency issue.
> we run workloads via argowf step by step sequentially, all pod share the same 
> applicationId, scheduled by YK.
> when argowf was slow, and heppened tp create pods at an interval of about 30 
> mins (create pod for next step 30s after previous pod finishes ), the app-id 
> was terminated and removed, just when the next pod of the same applicationId 
> arrived, the same old app-id was added again. the concurrency here is not 
> safe, we observed orphan tasks/pods whose app-id had already removed, the 
> tasks/pods stuck  forever.
> could any one help to look into this issue?
>  * andrew's argowf to reproduce the issue:
> @Po Han Huang Hi, below workflow can reproduce, I succeeded.
> you can modify the wait duration, in my company env I use 28s and reproduced 
> the issue, on my laptop I use 28400ms and reproduced the issue.
> apiVersion: argoproj.io/v1alpha1
> kind: Workflow
> metadata:
>   generateName: steps-with-suspend-
>   namespace: argo
> spec:
>   entrypoint: hello-and-wait
>   templates:
>   - name: hello-and-wait
>     steps:
>     - - name: hello1
>         template: print-message
>     - - name: wait1
>         template: wait
>     - - name: hello2
>         template: print-message
>     - - name: wait2
>         template: wait
>     - - name: hello3
>         template: print-message
>   - name: print-message
>     metadata:
>       labels:
>         applicationId: "{{{}workflow.name{}}}"
>         queue: "root.default"
>         schedulerName: yunikorn
>     container:
>       image: busybox
>       command: [echo]
>       args: ["============================="]
>   - name: wait
>     suspend:
>       duration: "28400ms"
>  * log file: 
> [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1769317449159249?thread_ts=1768967309.491029&cid=CLNUW68MU]
>  * [~wilfreds]'s last reply:
> “”“
> We allow re-use 30 seconds after the last allocation was removed. Until that 
> point in time the application still exists.
> This is what happens: * submit a pod -> creates an allocation request
>  * an application is generated based on the application ID on the pod or the 
> auto generation settings -> application is created
>  * pods get submitted referencing the same ID: new allocation requests linked 
> to that application ID, normal processing.
>  * the application stays in a RUNNING state until all the allocation requests 
> have been allocated and have finished
>  * if no allocation (pending or running) are linked to the application it 
> moves to a COMPLETING state. It can happen that a pods’ exit triggers a new 
> pod creation. We wait in COMPLETING state for 30 seconds. Any new pod that 
> comes in in that time frame will be tracked under the same application ID. A 
> new application with the same ID cannot be created.
>  * After the 30 seconds the application will be moved to COMPLETED.
>  ** tell the shim the app is done
>  ** remove the application from the queue
>  ** move app to terminated list to allow reuse
>  ** clean up state tracking to reduce memory footprint
> The application ID is now free for re-use by a new application.
> What happens in your case is that the shim update and a new allocation cross 
> each other. The core tells the shim via one channel while the shim tells the 
> core via another channel of a change. Neither side do in depth checks to 
> prevent issues. The shim blindly removes the app not checking if anything is 
> still pending. If it would then that would fail. The core assumes nothing 
> changed since the timer was triggered and cleans up too. If it would check it 
> would also not do that.
> We can make it a bit more robust on the core side with a simple change that 
> causes the reject. We should really fix both sides but that is more work 
> which would need planning. The real solution is getting rid of the 
> application state in the shim. A large undertaking…
> [4:49 PM]
> Could you file a jira to track all of this?
> ”“”



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to