[
https://issues.apache.org/jira/browse/YUNIKORN-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885290#comment-17885290
]
Qi Zhu commented on YUNIKORN-2884:
----------------------------------
Application should also release task fail case, pod should be deleted, because
TerminationType_STOPPED_BY_RM will also be called when task fail.
If we don't delete the pod, the pod will be stuck in pending state.
After applied this PR, i will add a new Jira for e2e testing, the test passed
locally.
And in future, i am working on retry policy when task failing, we will ajust
this logic to pod pending and deallocate in core side, but until now, we need
to make sure the task failed should be cleaned for applications.
> Task fail with post allocated but the pod will keep pending
> -----------------------------------------------------------
>
> Key: YUNIKORN-2884
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2884
> Project: Apache YuniKorn
> Issue Type: Sub-task
> Components: shim - kubernetes
> Reporter: Qi Zhu
> Assignee: Qi Zhu
> Priority: Major
> Labels: pull-request-available
>
> We will fail task post allocated, but we don't update the pod to terminal
> state.
> For example we bind pod volume failed post allocated, the pod will not go to
> terminal state, it will fail:
> Pod event:
> {code:java}
> Events:
> Type Reason Age From Message
> ---- ------ ---- ---- -------
> Normal Scheduling 30s yunikorn dev-nnjxy/pod-btv0y is
> queued and waiting for allocation
> Normal Scheduled 30s yunikorn Successfully assigned
> dev-nnjxy/pod-btv0y to node yktest-worker
> Warning PodVolumesBindFailure 20s yunikorn bind volumes to pod failed,
> name: dev-nnjxy/pod-btv0y, binding volumes: context deadline exceeded
> Normal TaskFailed 20s yunikorn Task dev-nnjxy/pod-btv0y is
> failed{code}
> Pod pending not going to terminal state
> {code:java}
> 2024-09-20T11:22:27.601Z INFO shim.fsm cache/task_state.go:381
> Task state transition {"app": "yunikorn-dev-03c96-autogen", "task":
> "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "taskAlias": "dev-03c96/pod-bgg9h",
> "source": "Scheduling", "destination": "Allocated", "event": "TaskAllocated"}
> 2024-09-20T11:22:37.606Z DEBUG shim.cache.task cache/task.go:499
> prepare to send release request {"applicationID":
> "yunikorn-dev-03c96-autogen", "taskID":
> "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "taskAlias": "dev-03c96/pod-bgg9h",
> "allocationKey": "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "task": "Allocated",
> "terminationType": ""}
> 2024-09-20T11:22:37.606Z DEBUG core.scheduler
> scheduler/scheduler.go:117 enqueued event {"eventType":
> "*rmevent.RMUpdateAllocationEvent", "event":
> {"Request":{"releases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"yunikorn-dev-03c96-autogen","terminationType":1,"message":"task
>
> completed","allocationKey":"6f3dd7fa-72b4-40cf-a700-43e51394a06b"}]},"rmID":"mycluster"}},
> "currentQueueSize": 0}
> 2024-09-20T11:22:37.606Z ERROR shim.cache.task cache/task.go:475
> task failed {"appID": "yunikorn-dev-03c96-autogen", "taskID":
> "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "reason": "bind volumes to pod
> failed, name: dev-03c96/pod-bgg9h, binding volumes: context deadline
> exceeded"}
> 2024-09-20T11:22:37.606Z INFO shim.fsm cache/task_state.go:381
> Task state transition {"app": "yunikorn-dev-03c96-autogen", "task":
> "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "taskAlias": "dev-03c96/pod-bgg9h",
> "source": "Allocated", "destination": "Failed", "event": "TaskFail"}
> 2024-09-20T11:22:37.606Z INFO core.scheduler.partition
> scheduler/partition.go:1359 removing allocation from application
> {"appID": "yunikorn-dev-03c96-autogen", "allocationKey":
> "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "terminationType": "STOPPED_BY_RM"}
> 2024-09-20T11:22:37.606Z DEBUG core.scheduler.ugm ugm/manager.go:132
> Decreasing resource usage {"user": "kubernetes-admin", "queue path":
> "root.dev-03c96", "application": "yunikorn-dev-03c96-autogen", "resource":
> "map[pods:1]", "removeApp": true}
> 2024-09-20T11:22:37.606Z DEBUG core.scheduler.ugm ugm/manager.go:152
> Decreasing resource usage for user {"user": "kubernetes-admin", "queue
> path": "root.dev-03c96", "application": "yunikorn-dev-03c96-autogen",
> "group": "", "resource": "map[pods:1]", "removeApp": true}
> 2024-09-20T11:22:37.606Z DEBUG core.scheduler.ugm
> ugm/queue_tracker.go:132 Decreasing resource usage {"queue path":
> "root", "hierarchy": ["root", "dev-03c96"], "application":
> "yunikorn-dev-03c96-autogen", "resource": "map[pods:1]", "removeApp": true}
> 2024-09-20T11:22:37.607Z DEBUG core.scheduler.ugm
> ugm/queue_tracker.go:132 Decreasing resource usage {"queue path":
> "root.dev-03c96", "hierarchy": ["dev-03c96"], "application":
> "yunikorn-dev-03c96-autogen", "resource": "map[pods:1]", "removeApp": true}
> 2024-09-20T11:22:37.607Z DEBUG core.scheduler.ugm
> ugm/queue_tracker.go:159 Removed application from running applications
> {"application": "yunikorn-dev-03c96-autogen", "queue path": "root.dev-03c96",
> "queue name": "dev-03c96"}
> 2024-09-20T11:22:37.608Z DEBUG core.scheduler.ugm
> ugm/queue_tracker.go:165 Successfully decreased resource usage {"queue
> path": "root.dev-03c96", "application": "yunikorn-dev-03c96-autogen",
> "resource": "map[pods:1]", "total resource after decreasing": "map[]", "total
> applications after decreasing": 0}
> 2024-09-20T11:22:37.608Z DEBUG core.scheduler.ugm
> ugm/queue_tracker.go:159 Removed application from running applications
> {"application": "yunikorn-dev-03c96-autogen", "queue path": "root", "queue
> name": "root"}
> 2024-09-20T11:22:37.608Z DEBUG core.scheduler.ugm
> ugm/queue_tracker.go:165 Successfully decreased resource usage {"queue
> path": "root", "application": "yunikorn-dev-03c96-autogen", "resource":
> "map[pods:1]", "total resource after decreasing": "map[]", "total
> applications after decreasing": 0}
> 2024-09-20T11:22:37.608Z DEBUG core.scheduler.application
> objects/application.go:336 Application state timer initiated {"appID":
> "yunikorn-dev-03c96-autogen", "state": "Completing", "timeout": "30s"}
> 2024-09-20T11:22:37.608Z INFO core.scheduler.fsm
> objects/application_state.go:147 Application state transition {"appID":
> "yunikorn-dev-03c96-autogen", "source": "Running", "destination":
> "Completing", "event": "completeApplication"}
> 2024-09-20T11:22:37.608Z DEBUG core.rmproxy rmproxy/rmproxy.go:60
> enqueue event {"eventType": "*rmevent.RMApplicationUpdateEvent", "event":
> {"RmID":"mycluster","AcceptedApplications":[],"RejectedApplications":[],"UpdatedApplications":[{"applicationID":"yunikorn-dev-03c96-autogen","state":"Completing","stateTransitionTimestamp":1726831357608331511,"message":"completeApplication"}]},
> "currentQueueSize": 0}
> 2024-09-20T11:22:37.608Z INFO core.scheduler.application
> objects/application.go:615 ask removed successfully from application
> {"appID": "yunikorn-dev-03c96-autogen", "ask":
> "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "pendingDelta": "nil resource"}
> 2024-09-20T11:22:37.608Z DEBUG core.rmproxy rmproxy/rmproxy.go:60
> enqueue event {"eventType": "*rmevent.RMReleaseAllocationEvent", "event":
> {"RmID":"mycluster","ReleasedAllocations":[{"partitionName":"[mycluster]default","applicationID":"yunikorn-dev-03c96-autogen","terminationType":1,"message":"allocation
> remove as per RM
> request","allocationKey":"6f3dd7fa-72b4-40cf-a700-43e51394a06b"}]},
> "currentQueueSize": 1}
> 2024-09-20T11:22:37.608Z DEBUG shim.rmcallback
> cache/scheduler_callback.go:108 UpdateApplication callback received
> {"UpdateApplicationResponse":
> "updated:{applicationID:\"yunikorn-dev-03c96-autogen\" state:\"Completing\"
> stateTransitionTimestamp:1726831357608331511
> message:\"completeApplication\"}"}
> 2024-09-20T11:22:37.608Z DEBUG shim.rmcallback
> cache/scheduler_callback.go:137 status update callback received
> {"appId": "yunikorn-dev-03c96-autogen", "new status": "Completing"}
> 2024-09-20T11:22:37.608Z DEBUG shim.rmcallback
> cache/scheduler_callback.go:47 UpdateAllocation callback received
> {"UpdateAllocationResponse": "released:{partitionName:\"[mycluster]default\"
> applicationID:\"yunikorn-dev-03c96-autogen\" terminationType:STOPPED_BY_RM
> message:\"allocation remove as per RM request\"
> allocationKey:\"6f3dd7fa-72b4-40cf-a700-43e51394a06b\"}"}
> 2024-09-20T11:22:38.605Z INFO shim.cache.application
> cache/application.go:239 task removed {"appID":
> "yunikorn-dev-03c96-autogen", "taskID":
> "6f3dd7fa-72b4-40cf-a700-43e51394a06b"}
> 2024-09-20T11:23:07.607Z DEBUG core.scheduler.application
> objects/application.go:352 Application state: auto progress
> {"applicationID": "yunikorn-dev-03c96-autogen", "state": "Completing"}
> 2024-09-20T11:23:07.607Z DEBUG core.scheduler.application
> objects/application.go:384 Application state timer cleared {"appID":
> "yunikorn-dev-03c96-autogen", "state": "Completing"}
> 2024-09-20T11:23:07.607Z DEBUG core.scheduler.application
> objects/application.go:336 Application state timer initiated {"appID":
> "yunikorn-dev-03c96-autogen", "state": "Completed", "timeout": "72h0m0s"}
> 2024-09-20T11:23:07.607Z INFO core.scheduler.fsm
> objects/application_state.go:147 Application state transition {"appID":
> "yunikorn-dev-03c96-autogen", "source": "Completing", "destination":
> "Completed", "event": "completeApplication"}
> 2024-09-20T11:23:07.607Z DEBUG core.rmproxy rmproxy/rmproxy.go:60
> enqueue event {"eventType": "*rmevent.RMApplicationUpdateEvent", "event":
> {"RmID":"mycluster","AcceptedApplications":[],"RejectedApplications":[],"UpdatedApplications":[{"applicationID":"yunikorn-dev-03c96-autogen","state":"Completed","stateTransitionTimestamp":1726831387607633721,"message":"completeApplication"}]},
> "currentQueueSize": 0}
> 2024-09-20T11:23:07.607Z DEBUG shim.rmcallback
> cache/scheduler_callback.go:108 UpdateApplication callback received
> {"UpdateApplicationResponse":
> "updated:{applicationID:\"yunikorn-dev-03c96-autogen\" state:\"Completed\"
> stateTransitionTimestamp:1726831387607633721
> message:\"completeApplication\"}"}
> 2024-09-20T11:23:07.607Z DEBUG shim.rmcallback
> cache/scheduler_callback.go:137 status update callback received
> {"appId": "yunikorn-dev-03c96-autogen", "new status": "Completed"}
> 2024-09-20T11:23:07.607Z INFO core.scheduler.queue
> objects/queue.go:830 Application completed and removed from queue
> {"queueName": "root.dev-03c96", "applicationID": "yunikorn-dev-03c96-autogen"}
> 2024-09-20T11:23:07.607Z INFO core.scheduler.partition
> scheduler/partition.go:1539 Removing terminated application from the
> application list {"appID": "yunikorn-dev-03c96-autogen", "app status":
> "Completed"}
> 2024-09-20T11:23:07.607Z INFO core.scheduler.application.usage
> objects/application_summary.go:60 YK_APP_SUMMARY: {ApplicationID:
> yunikorn-dev-03c96-autogen, SubmissionTime: 1726831345581, StartTime:
> 1726831347597, FinishTime: 1726831387607, User: kubernetes-admin, Queue:
> root.dev-03c96, State: Completed, RmID: mycluster, ResourceUsage:
> TrackedResource{UNKNOWN:pods=10}, PreemptedResource: TrackedResource{},
> PlaceholderResource: TrackedResource{}} {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]