[
https://issues.apache.org/jira/browse/YUNIKORN-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
he zheng yu updated YUNIKORN-3218:
----------------------------------
Description:
* andrew's 1st message:
Hi team, we had an concurrency issue.
we run workloads via argowf step by step sequentially, all pod share the same
applicationId, scheduled by YK.
when argowf was slow, and heppened tp create pods at an interval of about 30
mins (create pod for next step 30s after previous pod finishes ), the app-id
was terminated and removed, just when the next pod of the same applicationId
arrived, the same old app-id was added again. the concurrency here is not safe,
we observed orphan tasks/pods whose app-id had already removed, the tasks/pods
stuck forever.
could any one help to look into this issue?
* andrew's argowf to reproduce the issue:
@Po Han Huang Hi, below workflow can reproduce, I succeeded.
you can modify the wait duration, in my company env I use 28s and reproduced
the issue, on my laptop I use 28400ms and reproduced the issue.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: steps-with-suspend-
namespace: argo
spec:
entrypoint: hello-and-wait
templates:
- name: hello-and-wait
steps:
- - name: hello1
template: print-message
- - name: wait1
template: wait
- - name: hello2
template: print-message
- - name: wait2
template: wait
- - name: hello3
template: print-message
- name: print-message
metadata:
labels:
applicationId: "\{{workflow.name}}"
queue: "root.default"
schedulerName: yunikorn
container:
image: busybox
command: [echo]
args: ["============================="]
- name: wait
suspend:
duration: "28400ms"
* log file:
[https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1769317449159249?thread_ts=1768967309.491029&cid=CLNUW68MU]
* [~wilfreds]'s last reply:
We allow re-use 30 seconds after the last allocation was removed. Until that
point in time the application still exists.
This is what happens: * submit a pod -> creates an allocation request
* an application is generated based on the application ID on the pod or the
auto generation settings -> application is created
* pods get submitted referencing the same ID: new allocation requests linked
to that application ID, normal processing.
* the application stays in a RUNNING state until all the allocation requests
have been allocated and have finished
* if no allocation (pending or running) are linked to the application it moves
to a COMPLETING state. It can happen that a pods’ exit triggers a new pod
creation. We wait in COMPLETING state for 30 seconds. Any new pod that comes in
in that time frame will be tracked under the same application ID. A new
application with the same ID cannot be created.
* After the 30 seconds the application will be moved to COMPLETED.
** tell the shim the app is done
** remove the application from the queue
** move app to terminated list to allow reuse
** clean up state tracking to reduce memory footprint
The application ID is now free for re-use by a new application.
What happens in your case is that the shim update and a new allocation cross
each other. The core tells the shim via one channel while the shim tells the
core via another channel of a change. Neither side do in depth checks to
prevent issues. The shim blindly removes the app not checking if anything is
still pending. If it would then that would fail. The core assumes nothing
changed since the timer was triggered and cleans up too. If it would check it
would also not do that.
We can make it a bit more robust on the core side with a simple change that
causes the reject. We should really fix both sides but that is more work which
would need planning. The real solution is getting rid of the application state
in the shim. A large undertaking…
[4:49 PM]
Could you file a jira to track all of this?
was:
We are experiencing an issue where the YuniKorn Web UI continues to display
applications in the *New* state, even though these applications are no longer
present in the Kubernetes cluster. The list of such stale applications grows
over time while the scheduler is running, and is cleared only upon a scheduler
restart. In one instance, we observed this list growing to over 1200+ stale
applications.
This issue is reproducible even with the *1.6.3 build* running with the
*YUNIKORN-3084 patch* applied.
*Steps to Reproduce:*
# Create pods that fail immediately due to constraints (e.g., Kyverno policy
violations).
# Observe in the Web UI that applications remain in the New state even after
the pods are deleted from the cluster.
# Over time, the list of applications in the New state keeps growing.
# Restarting the scheduler resets the list, but the problem reappears as the
scheduler continues to run.
*Obeservations:*
* Applications remain in the *New* state in the Web UI, even after their
corresponding pods are deleted from the cluster.
* The problem appears to be related to the order and timing of create/delete
events received by the core.
* When a pod fails immediately (e.g., due to Kyverno policy violations), the
shim receives both create and delete requests, but the core does not create the
app in the partition context in time for the delete to be processed.
* The core eventually receives the create request, but not the corresponding
delete was received before that, resulting in the application remaining in the
New state indefinitely.
* The shim does not take any further action, leaving the application in this
stale state until a scheduler restart.
> applicaiotn-id reusing concurrency issue: remove-add race condition
> -------------------------------------------------------------------
>
> Key: YUNIKORN-3218
> URL: https://issues.apache.org/jira/browse/YUNIKORN-3218
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Reporter: he zheng yu
> Assignee: Peter Bacsko
> Priority: Critical
>
> * andrew's 1st message:
> Hi team, we had an concurrency issue.
> we run workloads via argowf step by step sequentially, all pod share the same
> applicationId, scheduled by YK.
> when argowf was slow, and heppened tp create pods at an interval of about 30
> mins (create pod for next step 30s after previous pod finishes ), the app-id
> was terminated and removed, just when the next pod of the same applicationId
> arrived, the same old app-id was added again. the concurrency here is not
> safe, we observed orphan tasks/pods whose app-id had already removed, the
> tasks/pods stuck forever.
> could any one help to look into this issue?
> * andrew's argowf to reproduce the issue:
> @Po Han Huang Hi, below workflow can reproduce, I succeeded.
> you can modify the wait duration, in my company env I use 28s and reproduced
> the issue, on my laptop I use 28400ms and reproduced the issue.
> apiVersion: argoproj.io/v1alpha1
> kind: Workflow
> metadata:
> generateName: steps-with-suspend-
> namespace: argo
> spec:
> entrypoint: hello-and-wait
> templates:
> - name: hello-and-wait
> steps:
> - - name: hello1
> template: print-message
> - - name: wait1
> template: wait
> - - name: hello2
> template: print-message
> - - name: wait2
> template: wait
> - - name: hello3
> template: print-message
> - name: print-message
> metadata:
> labels:
> applicationId: "\{{workflow.name}}"
> queue: "root.default"
> schedulerName: yunikorn
> container:
> image: busybox
> command: [echo]
> args: ["============================="]
> - name: wait
> suspend:
> duration: "28400ms"
> * log file:
> [https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1769317449159249?thread_ts=1768967309.491029&cid=CLNUW68MU]
> * [~wilfreds]'s last reply:
> We allow re-use 30 seconds after the last allocation was removed. Until that
> point in time the application still exists.
> This is what happens: * submit a pod -> creates an allocation request
> * an application is generated based on the application ID on the pod or the
> auto generation settings -> application is created
> * pods get submitted referencing the same ID: new allocation requests linked
> to that application ID, normal processing.
> * the application stays in a RUNNING state until all the allocation requests
> have been allocated and have finished
> * if no allocation (pending or running) are linked to the application it
> moves to a COMPLETING state. It can happen that a pods’ exit triggers a new
> pod creation. We wait in COMPLETING state for 30 seconds. Any new pod that
> comes in in that time frame will be tracked under the same application ID. A
> new application with the same ID cannot be created.
> * After the 30 seconds the application will be moved to COMPLETED.
> ** tell the shim the app is done
> ** remove the application from the queue
> ** move app to terminated list to allow reuse
> ** clean up state tracking to reduce memory footprint
> The application ID is now free for re-use by a new application.
> What happens in your case is that the shim update and a new allocation cross
> each other. The core tells the shim via one channel while the shim tells the
> core via another channel of a change. Neither side do in depth checks to
> prevent issues. The shim blindly removes the app not checking if anything is
> still pending. If it would then that would fail. The core assumes nothing
> changed since the timer was triggered and cleans up too. If it would check it
> would also not do that.
> We can make it a bit more robust on the core side with a simple change that
> causes the reject. We should really fix both sides but that is more work
> which would need planning. The real solution is getting rid of the
> application state in the shim. A large undertaking…
> [4:49 PM]
> Could you file a jira to track all of this?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]