[
https://issues.apache.org/jira/browse/YUNIKORN-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
he zheng yu updated YUNIKORN-3218:
----------------------------------
Description:
* andrew's 1st message:
Hi team, we had an concurrency issue.
we run workloads via argowf step by step sequentially, all pod share the same
applicationId, scheduled by YK.
when argowf was slow, and heppened tp create pods at an interval of about 30
mins (create pod for next step 30s after previous pod finishes ), the app-id
was terminated and removed, just when the next pod of the same applicationId
arrived, the same old app-id was added again. the concurrency here is not safe,
we observed orphan tasks/pods whose app-id had already removed, the tasks/pods
stuck forever.
could any one help to look into this issue?
* andrew's argowf to reproduce the issue:
@Po Han Huang Hi, below workflow can reproduce, I succeeded.
you can modify the wait duration, in my company env I use 28s and reproduced
the issue, on my laptop I use 28400ms and reproduced the issue.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: steps-with-suspend-
namespace: argo
spec:
entrypoint: hello-and-wait
templates:
- name: hello-and-wait
steps:
- - name: hello1
template: print-message
- - name: wait1
template: wait
- - name: hello2
template: print-message
- - name: wait2
template: wait
- - name: hello3
template: print-message
- name: print-message
metadata:
labels:
applicationId: "{{{}workflow.name{}}}"
queue: "root.default"
schedulerName: yunikorn
container:
image: busybox
command: [echo]
args: ["============================="]
- name: wait
suspend:
duration: "28400ms"
* log file:
[https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1769317449159249?thread_ts=1768967309.491029&cid=CLNUW68MU]
* [~wilfreds]'s last reply:
We allow re-use 30 seconds after the last allocation was removed. Until that
point in time the application still exists.
This is what happens: * submit a pod -> creates an allocation request
* an application is generated based on the application ID on the pod or the
auto generation settings -> application is created
* pods get submitted referencing the same ID: new allocation requests linked
to that application ID, normal processing.
* the application stays in a RUNNING state until all the allocation requests
have been allocated and have finished
* if no allocation (pending or running) are linked to the application it moves
to a COMPLETING state. It can happen that a pods’ exit triggers a new pod
creation. We wait in COMPLETING state for 30 seconds. Any new pod that comes in
in that time frame will be tracked under the same application ID. A new
application with the same ID cannot be created.
* After the 30 seconds the application will be moved to COMPLETED.
** tell the shim the app is done
** remove the application from the queue
** move app to terminated list to allow reuse
** clean up state tracking to reduce memory footprint
The application ID is now free for re-use by a new application.
What happens in your case is that the shim update and a new allocation cross
each other. The core tells the shim via one channel while the shim tells the
core via another channel of a change. Neither side do in depth checks to
prevent issues. The shim blindly removes the app not checking if anything is
still pending. If it would then that would fail. The core assumes nothing
changed since the timer was triggered and cleans up too. If it would check it
would also not do that.
We can make it a bit more robust on the core side with a simple change that
causes the reject. We should really fix both sides but that is more work which
would need planning. The real solution is getting rid of the application state
in the shim. A large undertaking…
[4:49 PM]
Could you file a jira to track all of this?
was:
* andrew's 1st message:
Hi team, we had an concurrency issue.
we run workloads via argowf step by step sequentially, all pod share the same
applicationId, scheduled by YK.
when argowf was slow, and heppened tp create pods at an interval of about 30
mins (create pod for next step 30s after previous pod finishes ), the app-id
was terminated and removed, just when the next pod of the same applicationId
arrived, the same old app-id was added again. the concurrency here is not safe,
we observed orphan tasks/pods whose app-id had already removed, the tasks/pods
stuck forever.
could any one help to look into this issue?
* andrew's argowf to reproduce the issue:
@Po Han Huang Hi, below workflow can reproduce, I succeeded.
you can modify the wait duration, in my company env I use 28s and reproduced
the issue, on my laptop I use 28400ms and reproduced the issue.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: steps-with-suspend-
namespace: argo
spec:
entrypoint: hello-and-wait
templates:
- name: hello-and-wait
steps:
- - name: hello1
template: print-message
- - name: wait1
template: wait
- - name: hello2
template: print-message
- - name: wait2
template: wait
- - name: hello3
template: print-message
- name: print-message
metadata:
labels:
applicationId: "{{{}workflow.name{}}}"
queue: "root.default"
schedulerName: yunikorn
container:
image: busybox
command: [echo]
args: ["============================="]
- name: wait
suspend:
duration: "28400ms"
* log file:
[https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1769317449159249?thread_ts=1768967309.491029&cid=CLNUW68MU]
* [~wilfreds]'s last reply:
We allow re-use 30 seconds after the last allocation was removed. Until that
point in time the application still exists.
This is what happens: * submit a pod -> creates an allocation request
* an application is generated based on the application ID on the pod or the
auto generation settings -> application is created
* pods get submitted referencing the same ID: new allocation requests linked
to that application ID, normal processing.
* the application stays in a RUNNING state until all the allocation requests
have been allocated and have finished
* if no allocation (pending or running) are linked to the application it moves
to a COMPLETING state. It can happen that a pods’ exit triggers a new pod
creation. We wait in COMPLETING state for 30 seconds. Any new pod that comes in
in that time frame will be tracked under the same application ID. A new
application with the same ID cannot be created.
* After the 30 seconds the application will be moved to COMPLETED.
** tell the shim the app is done
** remove the application from the queue
** move app to terminated list to allow reuse
** clean up state tracking to reduce memory footprint
The application ID is now free for re-use by a new application.
What happens in your case is that the shim update and a new allocation cross
each other. The core tells the shim via one channel while the shim tells the
core via another channel of a change. Neither side do in depth checks to
prevent issues. The shim blindly removes the app not checking if anything is
still pending. If it would then that would fail. The core assumes nothing
changed since the timer was triggered and cleans up too. If it would check it
would also not do that.
We can make it a bit more robust on the core side with a simple change that
causes the reject. We should really fix both sides but that is more work which
would need planning. The real solution is getting rid of the application state
in the shim. A large undertaking…
[4:49 PM]
Could you file a jira to track all of this?
> applicaiotn-id reusing concurrency issue: remove-add race condition
> -------------------------------------------------------------------
>
> Key: YUNIKORN-3218
> URL: https://issues.apache.org/jira/browse/YUNIKORN-3218
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Reporter: he zheng yu
> Assignee: Peter Bacsko
> Priority: Critical
>
> * andrew's 1st message:
> Hi team, we had an concurrency issue.
> we run workloads via argowf step by step sequentially, all pod share the same
> applicationId, scheduled by YK.
> when argowf was slow, and heppened tp create pods at an interval of about 30
> mins (create pod for next step 30s after previous pod finishes ), the app-id
> was terminated and removed, just when the next pod of the same applicationId
> arrived, the same old app-id was added again. the concurrency here is not
> safe, we observed orphan tasks/pods whose app-id had already removed, the
> tasks/pods stuck forever.
> could any one help to look into this issue?
> * andrew's argowf to reproduce the issue:
> @Po Han Huang Hi, below workflow can reproduce, I succeeded.
> you can modify the wait duration, in my company env I use 28s and reproduced
> the issue, on my laptop I use 28400ms and reproduced the issue.
> apiVersion: argoproj.io/v1alpha1
> kind: Workflow
> metadata:
> generateName: steps-with-suspend-
> namespace: argo
> spec:
> entrypoint: hello-and-wait
> templates:
> - name: hello-and-wait
> steps:
> - - name: hello1
> template: print-message
> - - name: wait1
> template: wait
> - - name: hello2
> template: print-message
> - - name: wait2
> template: wait
> - - name: hello3
> template: print-message
> - name: print-message
> metadata:
> labels:
> applicationId: "{{{}workflow.name{}}}"
> queue: "root.default"
> schedulerName: yunikorn
> container:
> image: busybox
> command: [echo]
> args: ["============================="]
> - name: wait
> suspend:
> duration: "28400ms"
> * log file:
> [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1769317449159249?thread_ts=1768967309.491029&cid=CLNUW68MU]
> * [~wilfreds]'s last reply:
> We allow re-use 30 seconds after the last allocation was removed. Until that
> point in time the application still exists.
> This is what happens: * submit a pod -> creates an allocation request
> * an application is generated based on the application ID on the pod or the
> auto generation settings -> application is created
> * pods get submitted referencing the same ID: new allocation requests linked
> to that application ID, normal processing.
> * the application stays in a RUNNING state until all the allocation requests
> have been allocated and have finished
> * if no allocation (pending or running) are linked to the application it
> moves to a COMPLETING state. It can happen that a pods’ exit triggers a new
> pod creation. We wait in COMPLETING state for 30 seconds. Any new pod that
> comes in in that time frame will be tracked under the same application ID. A
> new application with the same ID cannot be created.
> * After the 30 seconds the application will be moved to COMPLETED.
> ** tell the shim the app is done
> ** remove the application from the queue
> ** move app to terminated list to allow reuse
> ** clean up state tracking to reduce memory footprint
> The application ID is now free for re-use by a new application.
> What happens in your case is that the shim update and a new allocation cross
> each other. The core tells the shim via one channel while the shim tells the
> core via another channel of a change. Neither side do in depth checks to
> prevent issues. The shim blindly removes the app not checking if anything is
> still pending. If it would then that would fail. The core assumes nothing
> changed since the timer was triggered and cleans up too. If it would check it
> would also not do that.
> We can make it a bit more robust on the core side with a simple change that
> causes the reject. We should really fix both sides but that is more work
> which would need planning. The real solution is getting rid of the
> application state in the shim. A large undertaking…
> [4:49 PM]
> Could you file a jira to track all of this?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]