[
https://issues.apache.org/jira/browse/YUNIKORN-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wilfred Spiegelenburg reassigned YUNIKORN-3218:
-----------------------------------------------
Assignee: (was: Peter Bacsko)
> applicaiotn-id reusing concurrency issue: remove-add race condition
> -------------------------------------------------------------------
>
> Key: YUNIKORN-3218
> URL: https://issues.apache.org/jira/browse/YUNIKORN-3218
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Reporter: he zheng yu
> Priority: Critical
>
> * andrew's 1st message:
> Hi team, we had an concurrency issue.
> we run workloads via argowf step by step sequentially, all pod share the same
> applicationId, scheduled by YK.
> when argowf was slow, and heppened tp create pods at an interval of about 30
> mins (create pod for next step 30s after previous pod finishes ), the app-id
> was terminated and removed, just when the next pod of the same applicationId
> arrived, the same old app-id was added again. the concurrency here is not
> safe, we observed orphan tasks/pods whose app-id had already removed, the
> tasks/pods stuck forever.
> could any one help to look into this issue?
> * andrew's argowf to reproduce the issue:
> @Po Han Huang Hi, below workflow can reproduce, I succeeded.
> you can modify the wait duration, in my company env I use 28s and reproduced
> the issue, on my laptop I use 28400ms and reproduced the issue.
> apiVersion: argoproj.io/v1alpha1
> kind: Workflow
> metadata:
> generateName: steps-with-suspend-
> namespace: argo
> spec:
> entrypoint: hello-and-wait
> templates:
> - name: hello-and-wait
> steps:
> - - name: hello1
> template: print-message
> - - name: wait1
> template: wait
> - - name: hello2
> template: print-message
> - - name: wait2
> template: wait
> - - name: hello3
> template: print-message
> - name: print-message
> metadata:
> labels:
> applicationId: "{{{}workflow.name{}}}"
> queue: "root.default"
> schedulerName: yunikorn
> container:
> image: busybox
> command: [echo]
> args: ["============================="]
> - name: wait
> suspend:
> duration: "28400ms"
> * log file:
> [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1769317449159249?thread_ts=1768967309.491029&cid=CLNUW68MU]
> * [~wilfreds]'s last reply:
> “”“
> We allow re-use 30 seconds after the last allocation was removed. Until that
> point in time the application still exists.
> This is what happens: * submit a pod -> creates an allocation request
> * an application is generated based on the application ID on the pod or the
> auto generation settings -> application is created
> * pods get submitted referencing the same ID: new allocation requests linked
> to that application ID, normal processing.
> * the application stays in a RUNNING state until all the allocation requests
> have been allocated and have finished
> * if no allocation (pending or running) are linked to the application it
> moves to a COMPLETING state. It can happen that a pods’ exit triggers a new
> pod creation. We wait in COMPLETING state for 30 seconds. Any new pod that
> comes in in that time frame will be tracked under the same application ID. A
> new application with the same ID cannot be created.
> * After the 30 seconds the application will be moved to COMPLETED.
> ** tell the shim the app is done
> ** remove the application from the queue
> ** move app to terminated list to allow reuse
> ** clean up state tracking to reduce memory footprint
> The application ID is now free for re-use by a new application.
> What happens in your case is that the shim update and a new allocation cross
> each other. The core tells the shim via one channel while the shim tells the
> core via another channel of a change. Neither side do in depth checks to
> prevent issues. The shim blindly removes the app not checking if anything is
> still pending. If it would then that would fail. The core assumes nothing
> changed since the timer was triggered and cleans up too. If it would check it
> would also not do that.
> We can make it a bit more robust on the core side with a simple change that
> causes the reject. We should really fix both sides but that is more work
> which would need planning. The real solution is getting rid of the
> application state in the shim. A large undertaking…
> [4:49 PM]
> Could you file a jira to track all of this?
> ”“”
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]