[
https://issues.apache.org/jira/browse/YUNIKORN-2884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Qi Zhu updated YUNIKORN-2884:
-----------------------------
Parent: YUNIKORN-2804
Issue Type: Sub-task (was: Bug)
> Task fail with post allocated but the pod will keep pending
> -----------------------------------------------------------
>
> Key: YUNIKORN-2884
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2884
> Project: Apache YuniKorn
> Issue Type: Sub-task
> Components: shim - kubernetes
> Reporter: Qi Zhu
> Assignee: Qi Zhu
> Priority: Major
>
> We will fail task post allocated, but we don't update the pod to terminal
> state.
> For example we bind pod volume failed post allocated, the pod will not go to
> terminal state, it will fail:
> Pod event:
> {code:java}
> Events:
> Type Reason Age From Message
> ---- ------ ---- ---- -------
> Normal Scheduling 30s yunikorn dev-nnjxy/pod-btv0y is
> queued and waiting for allocation
> Normal Scheduled 30s yunikorn Successfully assigned
> dev-nnjxy/pod-btv0y to node yktest-worker
> Warning PodVolumesBindFailure 20s yunikorn bind volumes to pod failed,
> name: dev-nnjxy/pod-btv0y, binding volumes: context deadline exceeded
> Normal TaskFailed 20s yunikorn Task dev-nnjxy/pod-btv0y is
> failed{code}
> Pod pending not going to terminal state
> {code:java}
> 2024-09-20T11:22:27.601Z INFO shim.fsm cache/task_state.go:381
> Task state transition {"app": "yunikorn-dev-03c96-autogen", "task":
> "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "taskAlias": "dev-03c96/pod-bgg9h",
> "source": "Scheduling", "destination": "Allocated", "event": "TaskAllocated"}
> 2024-09-20T11:22:37.606Z DEBUG shim.cache.task cache/task.go:499
> prepare to send release request {"applicationID":
> "yunikorn-dev-03c96-autogen", "taskID":
> "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "taskAlias": "dev-03c96/pod-bgg9h",
> "allocationKey": "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "task": "Allocated",
> "terminationType": ""}
> 2024-09-20T11:22:37.606Z DEBUG core.scheduler
> scheduler/scheduler.go:117 enqueued event {"eventType":
> "*rmevent.RMUpdateAllocationEvent", "event":
> {"Request":{"releases":{"allocationsToRelease":[{"partitionName":"[mycluster]default","applicationID":"yunikorn-dev-03c96-autogen","terminationType":1,"message":"task
>
> completed","allocationKey":"6f3dd7fa-72b4-40cf-a700-43e51394a06b"}]},"rmID":"mycluster"}},
> "currentQueueSize": 0}
> 2024-09-20T11:22:37.606Z ERROR shim.cache.task cache/task.go:475
> task failed {"appID": "yunikorn-dev-03c96-autogen", "taskID":
> "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "reason": "bind volumes to pod
> failed, name: dev-03c96/pod-bgg9h, binding volumes: context deadline
> exceeded"}
> 2024-09-20T11:22:37.606Z INFO shim.fsm cache/task_state.go:381
> Task state transition {"app": "yunikorn-dev-03c96-autogen", "task":
> "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "taskAlias": "dev-03c96/pod-bgg9h",
> "source": "Allocated", "destination": "Failed", "event": "TaskFail"}
> 2024-09-20T11:22:37.606Z INFO core.scheduler.partition
> scheduler/partition.go:1359 removing allocation from application
> {"appID": "yunikorn-dev-03c96-autogen", "allocationKey":
> "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "terminationType": "STOPPED_BY_RM"}
> 2024-09-20T11:22:37.606Z DEBUG core.scheduler.ugm ugm/manager.go:132
> Decreasing resource usage {"user": "kubernetes-admin", "queue path":
> "root.dev-03c96", "application": "yunikorn-dev-03c96-autogen", "resource":
> "map[pods:1]", "removeApp": true}
> 2024-09-20T11:22:37.606Z DEBUG core.scheduler.ugm ugm/manager.go:152
> Decreasing resource usage for user {"user": "kubernetes-admin", "queue
> path": "root.dev-03c96", "application": "yunikorn-dev-03c96-autogen",
> "group": "", "resource": "map[pods:1]", "removeApp": true}
> 2024-09-20T11:22:37.606Z DEBUG core.scheduler.ugm
> ugm/queue_tracker.go:132 Decreasing resource usage {"queue path":
> "root", "hierarchy": ["root", "dev-03c96"], "application":
> "yunikorn-dev-03c96-autogen", "resource": "map[pods:1]", "removeApp": true}
> 2024-09-20T11:22:37.607Z DEBUG core.scheduler.ugm
> ugm/queue_tracker.go:132 Decreasing resource usage {"queue path":
> "root.dev-03c96", "hierarchy": ["dev-03c96"], "application":
> "yunikorn-dev-03c96-autogen", "resource": "map[pods:1]", "removeApp": true}
> 2024-09-20T11:22:37.607Z DEBUG core.scheduler.ugm
> ugm/queue_tracker.go:159 Removed application from running applications
> {"application": "yunikorn-dev-03c96-autogen", "queue path": "root.dev-03c96",
> "queue name": "dev-03c96"}
> 2024-09-20T11:22:37.608Z DEBUG core.scheduler.ugm
> ugm/queue_tracker.go:165 Successfully decreased resource usage {"queue
> path": "root.dev-03c96", "application": "yunikorn-dev-03c96-autogen",
> "resource": "map[pods:1]", "total resource after decreasing": "map[]", "total
> applications after decreasing": 0}
> 2024-09-20T11:22:37.608Z DEBUG core.scheduler.ugm
> ugm/queue_tracker.go:159 Removed application from running applications
> {"application": "yunikorn-dev-03c96-autogen", "queue path": "root", "queue
> name": "root"}
> 2024-09-20T11:22:37.608Z DEBUG core.scheduler.ugm
> ugm/queue_tracker.go:165 Successfully decreased resource usage {"queue
> path": "root", "application": "yunikorn-dev-03c96-autogen", "resource":
> "map[pods:1]", "total resource after decreasing": "map[]", "total
> applications after decreasing": 0}
> 2024-09-20T11:22:37.608Z DEBUG core.scheduler.application
> objects/application.go:336 Application state timer initiated {"appID":
> "yunikorn-dev-03c96-autogen", "state": "Completing", "timeout": "30s"}
> 2024-09-20T11:22:37.608Z INFO core.scheduler.fsm
> objects/application_state.go:147 Application state transition {"appID":
> "yunikorn-dev-03c96-autogen", "source": "Running", "destination":
> "Completing", "event": "completeApplication"}
> 2024-09-20T11:22:37.608Z DEBUG core.rmproxy rmproxy/rmproxy.go:60
> enqueue event {"eventType": "*rmevent.RMApplicationUpdateEvent", "event":
> {"RmID":"mycluster","AcceptedApplications":[],"RejectedApplications":[],"UpdatedApplications":[{"applicationID":"yunikorn-dev-03c96-autogen","state":"Completing","stateTransitionTimestamp":1726831357608331511,"message":"completeApplication"}]},
> "currentQueueSize": 0}
> 2024-09-20T11:22:37.608Z INFO core.scheduler.application
> objects/application.go:615 ask removed successfully from application
> {"appID": "yunikorn-dev-03c96-autogen", "ask":
> "6f3dd7fa-72b4-40cf-a700-43e51394a06b", "pendingDelta": "nil resource"}
> 2024-09-20T11:22:37.608Z DEBUG core.rmproxy rmproxy/rmproxy.go:60
> enqueue event {"eventType": "*rmevent.RMReleaseAllocationEvent", "event":
> {"RmID":"mycluster","ReleasedAllocations":[{"partitionName":"[mycluster]default","applicationID":"yunikorn-dev-03c96-autogen","terminationType":1,"message":"allocation
> remove as per RM
> request","allocationKey":"6f3dd7fa-72b4-40cf-a700-43e51394a06b"}]},
> "currentQueueSize": 1}
> 2024-09-20T11:22:37.608Z DEBUG shim.rmcallback
> cache/scheduler_callback.go:108 UpdateApplication callback received
> {"UpdateApplicationResponse":
> "updated:{applicationID:\"yunikorn-dev-03c96-autogen\" state:\"Completing\"
> stateTransitionTimestamp:1726831357608331511
> message:\"completeApplication\"}"}
> 2024-09-20T11:22:37.608Z DEBUG shim.rmcallback
> cache/scheduler_callback.go:137 status update callback received
> {"appId": "yunikorn-dev-03c96-autogen", "new status": "Completing"}
> 2024-09-20T11:22:37.608Z DEBUG shim.rmcallback
> cache/scheduler_callback.go:47 UpdateAllocation callback received
> {"UpdateAllocationResponse": "released:{partitionName:\"[mycluster]default\"
> applicationID:\"yunikorn-dev-03c96-autogen\" terminationType:STOPPED_BY_RM
> message:\"allocation remove as per RM request\"
> allocationKey:\"6f3dd7fa-72b4-40cf-a700-43e51394a06b\"}"}
> 2024-09-20T11:22:38.605Z INFO shim.cache.application
> cache/application.go:239 task removed {"appID":
> "yunikorn-dev-03c96-autogen", "taskID":
> "6f3dd7fa-72b4-40cf-a700-43e51394a06b"}
> 2024-09-20T11:23:07.607Z DEBUG core.scheduler.application
> objects/application.go:352 Application state: auto progress
> {"applicationID": "yunikorn-dev-03c96-autogen", "state": "Completing"}
> 2024-09-20T11:23:07.607Z DEBUG core.scheduler.application
> objects/application.go:384 Application state timer cleared {"appID":
> "yunikorn-dev-03c96-autogen", "state": "Completing"}
> 2024-09-20T11:23:07.607Z DEBUG core.scheduler.application
> objects/application.go:336 Application state timer initiated {"appID":
> "yunikorn-dev-03c96-autogen", "state": "Completed", "timeout": "72h0m0s"}
> 2024-09-20T11:23:07.607Z INFO core.scheduler.fsm
> objects/application_state.go:147 Application state transition {"appID":
> "yunikorn-dev-03c96-autogen", "source": "Completing", "destination":
> "Completed", "event": "completeApplication"}
> 2024-09-20T11:23:07.607Z DEBUG core.rmproxy rmproxy/rmproxy.go:60
> enqueue event {"eventType": "*rmevent.RMApplicationUpdateEvent", "event":
> {"RmID":"mycluster","AcceptedApplications":[],"RejectedApplications":[],"UpdatedApplications":[{"applicationID":"yunikorn-dev-03c96-autogen","state":"Completed","stateTransitionTimestamp":1726831387607633721,"message":"completeApplication"}]},
> "currentQueueSize": 0}
> 2024-09-20T11:23:07.607Z DEBUG shim.rmcallback
> cache/scheduler_callback.go:108 UpdateApplication callback received
> {"UpdateApplicationResponse":
> "updated:{applicationID:\"yunikorn-dev-03c96-autogen\" state:\"Completed\"
> stateTransitionTimestamp:1726831387607633721
> message:\"completeApplication\"}"}
> 2024-09-20T11:23:07.607Z DEBUG shim.rmcallback
> cache/scheduler_callback.go:137 status update callback received
> {"appId": "yunikorn-dev-03c96-autogen", "new status": "Completed"}
> 2024-09-20T11:23:07.607Z INFO core.scheduler.queue
> objects/queue.go:830 Application completed and removed from queue
> {"queueName": "root.dev-03c96", "applicationID": "yunikorn-dev-03c96-autogen"}
> 2024-09-20T11:23:07.607Z INFO core.scheduler.partition
> scheduler/partition.go:1539 Removing terminated application from the
> application list {"appID": "yunikorn-dev-03c96-autogen", "app status":
> "Completed"}
> 2024-09-20T11:23:07.607Z INFO core.scheduler.application.usage
> objects/application_summary.go:60 YK_APP_SUMMARY: {ApplicationID:
> yunikorn-dev-03c96-autogen, SubmissionTime: 1726831345581, StartTime:
> 1726831347597, FinishTime: 1726831387607, User: kubernetes-admin, Queue:
> root.dev-03c96, State: Completed, RmID: mycluster, ResourceUsage:
> TrackedResource{UNKNOWN:pods=10}, PreemptedResource: TrackedResource{},
> PlaceholderResource: TrackedResource{}} {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]