haorenhui created YUNIKORN-2989: ----------------------------------- Summary: Orphan task generated during application state switch Key: YUNIKORN-2989 URL: https://issues.apache.org/jira/browse/YUNIKORN-2989 Project: Apache YuniKorn Issue Type: Bug Components: core - scheduler, shim - kubernetes Affects Versions: 1.3.0 Environment: yunikorn 1.3.0 Reporter: haorenhui
I encountered a state transition issue during testing. The phenomenon is that I have two pods using the same applicationId, and pod2 is submitted at a certain point in time after pod1 completes running. When pod1 finishes running, the app (flow1) it belongs to transitions to the completing state, and it waits for 30 seconds (default) before changing to the completed state. When I submit pod2, it may remain in the pending state. Upon checking the yunikorn scheduler, the following log was found: after pod2 was added and its status changed to pending, the app was deleted. {panel:title=scheduler log} 2024-11-25T12:07:43.137Z INFO objects/application_state.go:133 Application state transition {"appID": "flow1", "source": "Running", "destination": "Completing", "event": "completeApplication"} 2024-11-25T12:07:43.137Z INFO scheduler/partition.go:1249 removing allocation from application {"appID":"flow1","allocationId":"a2e95dfb-3646-42b8-b774-6b2abe84bae6","terminationType": "STOPPED_BY_RM"} 2024-11-25T12:08:12.903Z INFO cachefcontext.go:871 task added {"appID":"flow1","taskID":"16f461e9-a856-428e-a27b-e67d9105da69", "taskState": "New"} 2024-11-25T12:08:13.137Z INFO objects/application.go:127 YK_APP_SUMMARY: {"appID": "flow1","submissionTime":1732533620209,"startTime":1732533622247,"finishTime": 1732536493137,"user":"admin","queue":"root.default","state": "Completed","rmID":"cluster",resourceUsage": {"UNKNOHN":"ephemeral-storage":1009530880000000,"memory":1034360822300672,"pods":98587,"vcore":2221462721}} 24-11-25T12:08:13.137Z INFD scheduler/partition.go:1418 Renoving terminated applicetion from the application list {"appID": "flow1","app status":"Completed"} 2024-11-25T12:08:13.137Z INFO objects/application_state.go:133 Application state transition Unknown macro: \{"appID"} 2024-11-25T12:08:13.143Z INFO cache/task_state.go:380 Task state transition {"app":"flow1","task":"16f461e9-a856-428e-a27b-e67d91ca69","taskAlias":"test/pod2", "source": "New","destination":"Pending","event": "InitTask"} {panel} * 16f461e9-a856-428e-a27b-e67d91ca69 will not be scheduled until I manually delete it. * This issue is not necessary and requires some specific triggering conditions (there are many pods and apps in my environment, and this problem occasionally occurs; if pod2 is submitted within 30 seconds, the app will switch to Running,Submit after 30 seconds, the app will be rebuilt). * I feel that when adding a task, the application in the core starts to change to completed, and then triggers callback to start clearing resources. After the app to which the task belongs is cleared, the task becomes an orphan task, and ultimately the task is not scheduled by anyone. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@yunikorn.apache.org For additional commands, e-mail: dev-h...@yunikorn.apache.org