[
https://issues.apache.org/jira/browse/YUNIKORN-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wilfred Spiegelenburg resolved YUNIKORN-2141.
---------------------------------------------
Fix Version/s: 1.4.0
Resolution: Fixed
> Should not preempt placeholders which has been released
> --------------------------------------------------------
>
> Key: YUNIKORN-2141
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2141
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Reporter: Qi Zhu
> Assignee: Qi Zhu
> Priority: Critical
> Labels: pull-request-available
> Fix For: 1.4.0
>
>
> The details about the bug:
> * The real pod created and waiting for scheduling after placeholders bound
> {code:java}
> {"stream":"stdout","log":"2023-11-08T15:16:14.912Z\tINFO\tcache/task_state.go:380\tTask
> state transition\t{\"app\": \"spark-28105bdfe17b494887c0c443f8a3ab0f\",
> \"task\": \"8837de6e-d888-4549-9baf-254c8a807421\", \"taskAlias\":
> \"dex-app-q5nslqd5/ogautaealleventsdynamicu2klogfm2-50-eb8bde8baf814091-driver\",
> \"source\": \"New\", \"destination\": \"Pending\", \"event\":
> \"InitTask\"}"}{code}
> {code:java}
> {"stream":"stdout","log":"2023-11-08T15:16:14.912Z\tINFO\tcache/task_state.go:380\tTask
> state transition\t{\"app\": \"spark-28105bdfe17b494887c0c443f8a3ab0f\",
> \"task\": \"8837de6e-d888-4549-9baf-254c8a807421\", \"taskAlias\":
> \"dex-app-q5nslqd5/ogautaealleventsdynamicu2klogfm2-50-eb8bde8baf814091-driver\",
> \"source\": \"Pending\", \"destination\": \"Scheduling\", \"event\":
> \"SubmitTask\"}"}{code}
> * Scheduler replace placeholder processed, and send release allocation
> request to shim side:
> {code:java}
> {"stream":"stdout","log":"2023-11-08T15:16:14.912Z\tINFO\tscheduler/partition.go:828\tscheduler
> replace placeholder processed\t{\"appID\":
> \"spark-28105bdfe17b494887c0c443f8a3ab0f\", \"allocationKey\":
> \"8837de6e-d888-4549-9baf-254c8a807421\", \"uuid\":
> \"9508439d-60a2-404e-9c84-bd2c6783b5c7\", \"placeholder released uuid\":
> \"cc243ba1-7054-4b07-8344-6afb1424b1e0\"}"}{code}
> {code:java}
> {"stream":"stdout","log":"2023-11-08T15:16:14.913Z\tINFO\tcache/application.go:637\ttry
> to release pod from application\t{\"appID\":
> \"spark-28105bdfe17b494887c0c443f8a3ab0f\", \"allocationUUID\":
> \"cc243ba1-7054-4b07-8344-6afb1424b1e0\", \"terminationType\":
> \"PLACEHOLDER_REPLACED\"}"}{code}
> * The same time, Preempting task try to preempt the already sent release
> allocation
> {code:java}
> {"stream":"stdout","log":"2023-11-08T15:16:20.870Z\tINFO\tobjects/preemption.go:563\tPreempting
> task\t{\"applicationID\": \"spark-28105bdfe17b494887c0c443f8a3ab0f\",
> \"allocationKey\": \"e6e91651-7152-42f5-8504-355590fa0079\", \"nodeID\":
> \"ip-10-157-240-201.ec2.internal\", \"resources\": \"map[memory:3430940672
> pods:1 vcore:2100]\"}"}{code}
> {code:java}
> {"stream":"stdout","log":"2023-11-08T15:16:20.871Z\tINFO\tcache/application.go:637\ttry
> to release pod from application\t{\"appID\":
> \"spark-28105bdfe17b494887c0c443f8a3ab0f\", \"allocationUUID\":
> \"cc243ba1-7054-4b07-8344-6afb1424b1e0\", \"terminationType\":
> \"PREEMPTED_BY_SCHEDULER\"}"}{code}
> * The pod deleted and trigger complete task and the terminationType is
> PREEMPTED_BY_SCHEDULER
> {code:java}
> {"stream":"stdout","log":"2023-11-08T15:16:45.489Z\tINFO\tgeneral/general.go:204\tdelete
> pod\t{\"appType\": \"general\", \"namespace\": \"dex-app-q5nslqd5\",
> \"podName\": \"tg-spark-driver-spark-28105bdfe17b494887c0c4-0\", \"podUID\":
> \"e6e91651-7152-42f5-8504-355590fa0079\"}"}{code}
> {code:java}
> {"stream":"stdout","log":"2023-11-08T15:16:45.489Z\tINFO\tcache/task_state.go:380\tTask
> state transition\t{\"app\": \"spark-28105bdfe17b494887c0c443f8a3ab0f\",
> \"task\": \"e6e91651-7152-42f5-8504-355590fa0079\", \"taskAlias\":
> \"dex-app-q5nslqd5/tg-spark-driver-spark-28105bdfe17b494887c0c4-0\",
> \"source\": \"Bound\", \"destination\": \"Completed\", \"event\":
> \"CompleteTask\"}"}{code}
> {code:java}
> {"stream":"stdout","log":"2023-11-08T15:16:45.489Z\tINFO\tscheduler/partition.go:1245\tremoving
> allocation from application\t{\"appID\":
> \"spark-28105bdfe17b494887c0c443f8a3ab0f\", \"allocationId\":
> \"cc243ba1-7054-4b07-8344-6afb1424b1e0\", \"terminationType\":
> \"PREEMPTED_BY_SCHEDULER\"}"}{code}
> * The real pod will always pending, because the core side doesn't receive
> the release response for PLACEHOLDER_REPLACED
> {code:java}
> // if we have an uuid the termination type is important
> if release.TerminationType == si.TerminationType_PLACEHOLDER_REPLACED {
> log.Logger().Info("replacing placeholder allocation",
> zap.String("appID", appID),
> zap.String("allocationId", uuid))
> if alloc := app.ReplaceAllocation(uuid); alloc != nil {
> released = append(released, alloc)
> }
> } else {
> log.Logger().Info("removing allocation from application",
> zap.String("appID", appID),
> zap.String("allocationId", uuid),
> zap.String("terminationType", release.TerminationType.String()))
> if alloc := app.RemoveAllocation(uuid); alloc != nil {
> released = append(released, alloc)
> }
> } {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]