[jira] [Commented] (YUNIKORN-2926) The Pod using gang scheduling is stuck in the Pending state

Wilfred Spiegelenburg (Jira) Mon, 14 Oct 2024 21:36:05 -0700


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889430#comment-17889430
 ]


Wilfred Spiegelenburg commented on YUNIKORN-2926:
-------------------------------------------------

When the placeholders are released we do not immediately try to place the real 
pod (the larger one) on a node. We cannot do that as we need to track changes. 
The released placeholder must be processed before we look again.

If the real pod is larger than the placeholder the large resource requirement 
might not fit in the queue or any node and thus never get scheduled. So we 
process them as normal allocations with all the checks. Depending on the 
difference you might be able to accomodate all real pods or just a fraction of 
them.

These scenarios mean that in the pods stay pending and there is nothing wrong. 
You need to do a proper analysis of why the pod stays pending. Nothing provided 
here shows that we have a problem.

> The Pod using gang scheduling is stuck in the Pending state
> -----------------------------------------------------------
>
>                 Key: YUNIKORN-2926
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2926
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: wangzhihui
>            Priority: Minor
>             Fix For: 1.5.0
>
>         Attachments: image-2024-10-15-11-54-33-458.png, image.png
>
>
> desc：
>  The reason for the real allocation is larger than all placeholder，Then 
> release all allocations。Causing all Pods is Pending state.
> !image-2024-10-15-11-54-33-458.png!
> !image.png!
> {code:java}
> // code placeholder
> apiVersion: batch/v1
> kind: Job
> metadata:
>   name: simple-gang-job
> spec:
>   completions: 2
>   parallelism: 2
>   template:
>     metadata:
>       labels:
>         app: sleep
>         applicationId: "simple-gang-job"
>         queue: root.default
>       annotations:
>         yunikorn.apache.org/schedulingPolicyParameters: 
> "placeholderTimeoutInSeconds=30 gangSchedulingStyle=Hard"
>         yunikorn.apache.org/task-group-name: task-group-example
>         yunikorn.apache.org/task-groups: |-
>           [{
>               "name": "task-group-example",
>               "minMember": 1,
>               "minResource": {
>                 "cpu": "100m",
>                 "memory": "50M"
>               },
>               "nodeSelector": {},
>               "tolerations": [],
>               "affinity": {},
>               "topologySpreadConstraints": []
>           }]
>     spec:
>       schedulerName: yunikorn
>       restartPolicy: Never
>       containers:
>         - name: sleep30
>           image: "alpine:latest"
>           command: ["sleep", "99999999"]
>           resources:
>             requests:
>               cpu: "200m"
>               memory: "50M" {code}
> solution：
> If the app is in Hard mode, it will transition to a Failing state. If it is 
> in Soft mode, it will transition to a Resuming state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YUNIKORN-2926) The Pod using gang scheduling is stuck in the Pending state

Reply via email to