haorenhui created YUNIKORN-2989:
-----------------------------------
Summary: Orphan task generated during application state switch
Key: YUNIKORN-2989
URL: https://issues.apache.org/jira/browse/YUNIKORN-2989
Project: Apache YuniKorn
Issue Type: Bug
Components: core - scheduler, shim - kubernetes
Affects Versions: 1.3.0
Environment: yunikorn 1.3.0
Reporter: haorenhui
I encountered a state transition issue during testing. The phenomenon is that I
have two pods using the same applicationId, and pod2 is submitted at a certain
point in time after pod1 completes running. When pod1 finishes running, the app
(flow1) it belongs to transitions to the completing state, and it waits for 30
seconds (default) before changing to the completed state. When I submit pod2,
it may remain in the pending state. Upon checking the yunikorn scheduler, the
following log was found: after pod2 was added and its status changed to
pending, the app was deleted.
{panel:title=scheduler log}
2024-11-25T12:07:43.137Z INFO objects/application_state.go:133 Application
state transition {"appID": "flow1", "source": "Running", "destination":
"Completing", "event": "completeApplication"}
2024-11-25T12:07:43.137Z INFO scheduler/partition.go:1249 removing allocation
from application
{"appID":"flow1","allocationId":"a2e95dfb-3646-42b8-b774-6b2abe84bae6","terminationType":
"STOPPED_BY_RM"}
2024-11-25T12:08:12.903Z INFO cachefcontext.go:871 task added
{"appID":"flow1","taskID":"16f461e9-a856-428e-a27b-e67d9105da69", "taskState":
"New"}
2024-11-25T12:08:13.137Z INFO objects/application.go:127 YK_APP_SUMMARY:
{"appID":
"flow1","submissionTime":1732533620209,"startTime":1732533622247,"finishTime":
1732536493137,"user":"admin","queue":"root.default","state":
"Completed","rmID":"cluster",resourceUsage":
{"UNKNOHN":"ephemeral-storage":1009530880000000,"memory":1034360822300672,"pods":98587,"vcore":2221462721}}
24-11-25T12:08:13.137Z INFD scheduler/partition.go:1418 Renoving terminated
applicetion from the application list {"appID": "flow1","app
status":"Completed"}
2024-11-25T12:08:13.137Z INFO objects/application_state.go:133 Application
state transition
Unknown macro: \{"appID"}
2024-11-25T12:08:13.143Z INFO cache/task_state.go:380 Task state transition
{"app":"flow1","task":"16f461e9-a856-428e-a27b-e67d91ca69","taskAlias":"test/pod2",
"source": "New","destination":"Pending","event": "InitTask"}
{panel}
* 16f461e9-a856-428e-a27b-e67d91ca69 will not be scheduled until I manually
delete it.
* This issue is not necessary and requires some specific triggering conditions
(there are many pods and apps in my environment, and this problem occasionally
occurs; if pod2 is submitted within 30 seconds, the app will switch to
Running,Submit after 30 seconds, the app will be rebuilt).
* I feel that when adding a task, the application in the core starts to change
to completed, and then triggers callback to start clearing resources. After the
app to which the task belongs is cleared, the task becomes an orphan task, and
ultimately the task is not scheduled by anyone.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]