[
https://issues.apache.org/jira/browse/YUNIKORN-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wilfred Spiegelenburg updated YUNIKORN-2926:
--------------------------------------------
Target Version: 1.7.0, 1.6.1
We update the tracking data in multiple places which makes tracing difficult.
We also do not check if the tracking data is correct in the unit tests.
This could easily lead to the tracking data to not be updated correctly in this
case. Stepping through the code I think we have multiple edge cases that are
not doing the right thing.
Opening a PR to fix what I have found.
> The Pod using gang scheduling is stuck in the Pending state
> -----------------------------------------------------------
>
> Key: YUNIKORN-2926
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2926
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Reporter: wangzhihui
> Assignee: Wilfred Spiegelenburg
> Priority: Minor
> Attachments: image-2024-10-15-11-54-33-458.png, image.png
>
>
> desc:
> The reason for the real allocation is larger than all placeholder,Then
> release all allocations。Causing all Pods is Pending state.
> !image-2024-10-15-11-54-33-458.png!
> !image.png!
> {code:java}
> // code placeholder
> apiVersion: batch/v1
> kind: Job
> metadata:
> name: simple-gang-job
> spec:
> completions: 2
> parallelism: 2
> template:
> metadata:
> labels:
> app: sleep
> applicationId: "simple-gang-job"
> queue: root.default
> annotations:
> yunikorn.apache.org/schedulingPolicyParameters:
> "placeholderTimeoutInSeconds=30 gangSchedulingStyle=Hard"
> yunikorn.apache.org/task-group-name: task-group-example
> yunikorn.apache.org/task-groups: |-
> [{
> "name": "task-group-example",
> "minMember": 2,
> "minResource": {
> "cpu": "100m",
> "memory": "50M"
> },
> "nodeSelector": {},
> "tolerations": [],
> "affinity": {},
> "topologySpreadConstraints": []
> }]
> spec:
> schedulerName: yunikorn
> restartPolicy: Never
> containers:
> - name: sleep30
> image: "alpine:latest"
> command: ["sleep", "99999999"]
> resources:
> requests:
> cpu: "200m"
> memory: "50M" {code}
> solution:
> If the app is in Hard mode, it will transition to a Failing state. If it is
> in Soft mode, it will transition to a Resuming state.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]