[ 
https://issues.apache.org/jira/browse/YUNIKORN-3128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18043760#comment-18043760
 ] 

Jeff Gao edited comment on YUNIKORN-3128 at 12/9/25 6:50 AM:
-------------------------------------------------------------

While being aware of the idea of 'retrying the allocation' (YUNIKORN-2804), I 
wonder if it also makes sense to do it the other way:

Upon bind failure (if transient, we can retry a few times on the spot and see 
if it helps), we reverse the allocation decision done by core, and pick up the 
pod in the next scheduling cycle. On high level, it looks like:
 * Shim calls core to reverse the allocation by specifying the allocations to 
release.
 * Shim puts the task back to pending, to be picked up in the next scheduling 
cycle (as a brand new allocation).

This approach seems simpler to me. I wonder what you think? [~pbacsko] 

 


was (Author: JIRAUSER311725):
While being aware of the idea of 'retrying the allocation' (YUNIKORN-2804), I 
wonder if it also makes sense to do it the other way:

Upon bind failure (if transient, we can retry a few times on the spot and see 
if it helps), we reverse the allocation decision done by core, and pick up the 
pod in the next scheduling cycle. On high level, it looks like:
 * Shim calls core to reverse the allocation by specifying the allocations to 
release.
 * Shim puts the task back to pending, to be picked up in the next scheduling 
cycle (as a brand new allocation).

This approach seems simpler to me. What do you think? [~pbacsko] 

 

> Yunikorn ignores pending pods after apiserver errors
> ----------------------------------------------------
>
>                 Key: YUNIKORN-3128
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-3128
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>    Affects Versions: 1.7.0
>         Environment: EKS 1.31
>            Reporter: Ruiwen Zhao
>            Priority: Major
>
> We are running some load testing with Yunikorn, where pods are created at 
> 200/s and we monitor if Yunikorn can schedule them at the same rate. 
>  
> One issue we saw is that Yunikorn ends up ignoring a bunch of (~2000) pods at 
> the end of the load test, and completes the application. As shown below, 
> there are many pods still Pending, but Yunikorn completes the application 
> they belong to, and therefore those pods are stuck. All the pods has 
> "schedulerName: yunikorn".
>  
> {code:java}
> ❯ kc get pods -n spark8s-kube-burner-yunikorn | grep Pending  | head
> kube-burner-0-0-82077   0/1     Pending   0          16m
> kube-burner-0-0-82105   0/1     Pending   0          16m
> kube-burner-0-0-82129   0/1     Pending   0          16m
> kube-burner-0-0-82132   0/1     Pending   0          16m
> kube-burner-0-0-82140   0/1     Pending   0          16m
> kube-burner-0-0-82141   0/1     Pending   0          16m
> 2025-09-29T18:28:18.866Z    INFO    core.scheduler.fsm    
> objects/application_state.go:147    Application state transition    {"appID": 
> "yunikorn-spark8s-kube-burner-yunikorn-0", "source": "Completing", 
> "destination": "Completed", "event": "completeApplication"} {code}
> When looking at one of the Pending pods (kube-burner-0-0-82077), we can 
> Yunikorn was trying to schedule it, but failed to do so because of the etcd 
> errors. Yunikorn retried once, failed again, and then Yunikorn submitted the 
> task again, but no log after that:
> {code:java}
> 2025-09-30T21:18:45.248Z INFO shim.fsm cache/task_state.go:381 Task state 
> transition {"app": "yunikorn-spark8s-kube-burner-yunikorn-0", "task": 
> "731dc815-9ee0-4767-a5a9-939219b94f6e", "taskAlias": 
> "spark8s-kube-burner-yunikorn/kube-burner-0-0-82077", "source": "New", 
> "destination": "Pending", "event": "InitTask"}
> 2025-09-30T21:18:45.260Z INFO shim.fsm cache/task_state.go:381 Task state 
> transition {"app": "yunikorn-spark8s-kube-burner-yunikorn-0", "task": 
> "731dc815-9ee0-4767-a5a9-939219b94f6e", "taskAlias": 
> "spark8s-kube-burner-yunikorn/kube-burner-0-0-82077", "source": "Pending", 
> "destination": "Scheduling", "event": "SubmitTask"} 
> 2025-09-30T21:18:59.464Z ERROR shim.client client/kubeclient.go:127 failed to 
> bind pod {"namespace": "spark8s-kube-burner-yunikorn", "podName": 
> "kube-burner-0-0-82077", "error": "Operation cannot be fulfilled on 
> pods/binding \"kube-burner-0-0-82077\": etcdserver: request timed out"}
> 2025-09-30T21:18:59.465Z ERROR shim.cache.task cache/task.go:464 task failed 
> {"appID": "yunikorn-spark8s-kube-burner-yunikorn-0", "taskID": 
> "731dc815-9ee0-4767-a5a9-939219b94f6e", "reason": "bind pod to node failed, 
> name: spark8s-kube-burner-yunikorn/kube-burner-0-0-82077, Operation cannot be 
> fulfilled on pods/binding \"kube-burner-0-0-82077\": etcdserver: request 
> timed out"} 
> 2025-09-30T21:18:59.465Z INFO shim.fsm cache/task_state.go:381 Task state 
> transition {"app": "yunikorn-spark8s-kube-burner-yunikorn-0", "task": 
> "731dc815-9ee0-4767-a5a9-939219b94f6e", "taskAlias": 
> "spark8s-kube-burner-yunikorn/kube-burner-0-0-82077", "source": "Allocated", 
> "destination": "Failed", "event": "TaskFail"} 
> 2025-09-30T21:18:59.464Z ERROR shim.cache.task cache/task.go:388 bind pod to 
> node failed {"taskID": "731dc815-9ee0-4767-a5a9-939219b94f6e", "error": 
> "Operation cannot be fulfilled on pods/binding \"kube-burner-0-0-82077\": 
> etcdserver: request timed out"}
> 2025-09-30T21:19:25.464Z INFO shim.fsm cache/task_state.go:381 Task state 
> transition {"app": "yunikorn-spark8s-kube-burner-yunikorn-0", "task": 
> "731dc815-9ee0-4767-a5a9-939219b94f6e", "taskAlias": 
> "spark8s-kube-burner-yunikorn/kube-burner-0-0-82077", "source": "New", 
> "destination": "Pending", "event": "InitTask"}
> 2025-09-30T21:19:25.464Z INFO shim.fsm cache/task_state.go:381 Task state 
> transition {"app": "yunikorn-spark8s-kube-burner-yunikorn-0", "task": 
> "731dc815-9ee0-4767-a5a9-939219b94f6e", "taskAlias": 
> "spark8s-kube-burner-yunikorn/kube-burner-0-0-82077", "source": "Pending", 
> "destination": "Scheduling", "event": "SubmitTask"} {code}
> The failure seems to be caused by etcd timeout, which makes sense, but IMO 
> the expected behavior is that Yunikorn keeps trying to schedule the pods with 
> backoff. 
>  
> Yunkorn version: 1.7.0
> Env: EKS 1.31



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to