haorenhui created YUNIKORN-2989:
-----------------------------------

             Summary: Orphan task generated during application state switch
                 Key: YUNIKORN-2989
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2989
             Project: Apache YuniKorn
          Issue Type: Bug
          Components: core - scheduler, shim - kubernetes
    Affects Versions: 1.3.0
         Environment: yunikorn 1.3.0
            Reporter: haorenhui


I encountered a state transition issue during testing. The phenomenon is that I 
have two pods using the same applicationId, and pod2 is submitted at a certain 
point in time after pod1 completes running. When pod1 finishes running, the app 
(flow1) it belongs to transitions to the completing state, and it waits for 30 
seconds (default) before changing to the completed state. When I submit pod2, 
it may remain in the pending state. Upon checking the yunikorn scheduler, the 
following log was found: after pod2 was added and its status changed to 
pending, the app was deleted.
{panel:title=scheduler log}
2024-11-25T12:07:43.137Z INFO objects/application_state.go:133 Application 
state transition {"appID": "flow1", "source": "Running", "destination": 
"Completing", "event": "completeApplication"}
2024-11-25T12:07:43.137Z INFO scheduler/partition.go:1249 removing allocation 
from application 
{"appID":"flow1","allocationId":"a2e95dfb-3646-42b8-b774-6b2abe84bae6","terminationType":
 "STOPPED_BY_RM"}
2024-11-25T12:08:12.903Z INFO cachefcontext.go:871 task added 
{"appID":"flow1","taskID":"16f461e9-a856-428e-a27b-e67d9105da69", "taskState": 
"New"}
2024-11-25T12:08:13.137Z INFO objects/application.go:127 YK_APP_SUMMARY: 
{"appID": 
"flow1","submissionTime":1732533620209,"startTime":1732533622247,"finishTime": 
1732536493137,"user":"admin","queue":"root.default","state": 
"Completed","rmID":"cluster",resourceUsage": 
{"UNKNOHN":"ephemeral-storage":1009530880000000,"memory":1034360822300672,"pods":98587,"vcore":2221462721}}
24-11-25T12:08:13.137Z INFD scheduler/partition.go:1418 Renoving terminated 
applicetion from the application list {"appID": "flow1","app 
status":"Completed"}
2024-11-25T12:08:13.137Z INFO objects/application_state.go:133 Application 
state transition
Unknown macro: \{"appID"}
2024-11-25T12:08:13.143Z INFO cache/task_state.go:380 Task state transition 
{"app":"flow1","task":"16f461e9-a856-428e-a27b-e67d91ca69","taskAlias":"test/pod2",
 "source": "New","destination":"Pending","event": "InitTask"}
{panel}
 * 16f461e9-a856-428e-a27b-e67d91ca69 will not be scheduled until I manually 
delete it.
 * This issue is not necessary and requires some specific triggering conditions 
(there are many pods and apps in my environment, and this problem occasionally 
occurs; if pod2 is submitted within 30 seconds, the app will switch to 
Running,Submit after 30 seconds, the app will be rebuilt).
 * I feel that when adding a task, the application in the core starts to change 
to completed, and then triggers callback to start clearing resources. After the 
app to which the task belongs is cleared, the task becomes an orphan task, and 
ultimately the task is not scheduled by anyone.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org
For additional commands, e-mail: dev-h...@yunikorn.apache.org

Reply via email to