[
https://issues.apache.org/jira/browse/YUNIKORN-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wilfred Spiegelenburg resolved YUNIKORN-516.
--------------------------------------------
Resolution: Duplicate
> Yunikorn scheduler seems to be in deadlock state
> ------------------------------------------------
>
> Key: YUNIKORN-516
> URL: https://issues.apache.org/jira/browse/YUNIKORN-516
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Reporter: Ayub Pathan
> Assignee: Wilfred Spiegelenburg
> Priority: Blocker
> Attachments: metrics, stack, yk.log
>
>
> Apply below job templates to reproduce the issue.
> # First application with gang scheduling annotations
>
> {noformat}
> apiVersion: batch/v1
> kind: Job
> metadata:
> name: batch-sleep-job-1
> spec:
> completions: 2
> parallelism: 2
> template:
> metadata:
> labels:
> app: sleep
> applicationId: "batch-sleep-job-1"
> queue: root.sandbox
> annotations:
> yunikorn.apache.org/task-group-name: tg1
> yunikorn.apache.org/task-groups: |-
> [{
> "name": "tg1",
> "minMember": 2,
> "minResource": {
> "cpu": "100m",
> "memory": "500M"
> },
> "nodeSelector": {},
> "tolerations": []
> }]
> spec:
> schedulerName: yunikorn
> restartPolicy: Never
> containers:
> - name: sleep300
> image: "alpine:latest"
> command: ["sleep", "300"]
> resources:
> requests:
> cpu: "100m"
> memory: "500M" {noformat}
>
> 2. First application to the same task group
> {noformat}
> apiVersion: batch/v1
> kind: Job
> metadata:
> name: batch-sleep-job-2
> spec:
> completions: 4
> parallelism: 4
> template:
> metadata:
> labels:
> app: sleep
> applicationId: "batch-sleep-job-2"
> queue: root.sandbox
> annotations:
> yunikorn.apache.org/task-group-name: tg1
> yunikorn.apache.org/task-groups: |-
> [{
> "name": "tg1",
> "minMember": 2,
> "minResource": {
> "cpu": "100m",
> "memory": "500M"
> },
> "nodeSelector": {},
> "tolerations": []
> }]
> spec:
> schedulerName: yunikorn
> restartPolicy: Never
> containers:
> - name: sleep300
> image: "alpine:latest"
> command: ["sleep", "300"]
> resources:
> requests:
> cpu: "100m"
> memory: "500M"{noformat}
>
> 3. Third application to the same task group
> {noformat}
> apiVersion: batch/v1
> kind: Job
> metadata:
> name: batch-sleep-job-3
> spec:
> completions: 10
> parallelism: 10
> template:
> metadata:
> labels:
> app: sleep
> applicationId: "batch-sleep-job-3"
> queue: root.sandbox
> annotations:
> yunikorn.apache.org/task-group-name: tg1
> yunikorn.apache.org/task-groups: |-
> [{
> "name": "tg1",
> "minMember": 3,
> "minResource": {
> "cpu": "100m",
> "memory": "500M"
> },
> "nodeSelector": {},
> "tolerations": []
> }]
> spec:
> schedulerName: yunikorn
> restartPolicy: Never
> containers:
> - name: sleep300
> image: "alpine:latest"
> command: ["sleep", "300"]
> resources:
> requests:
> cpu: "100m"
> memory: "500M" {noformat}
> Now it can be seen that, the 3rd application is in pending state even though
> the place holder apps are created and terminated.
> {noformat}
> NAME↑ READY STATUS RS CPU MEM %CPU/R %MEM/R %CPU/L
> %MEM/L IP NODE
> QOS AGE │
> │ batch-sleep-job-1-7lrd5 0/1 Completed 0 n/a n/a n/a n/a n/a
> n/a 100.100.142.208 ip-10-192-143-108.ca-central-1.compute.internal
> BU 18m │
> │ batch-sleep-job-1-lw4t9 0/1 Completed 0 n/a n/a n/a n/a n/a
> n/a 100.100.134.213 ip-10-192-136-201.ca-central-1.compute.internal
> BU 18m │
> │ batch-sleep-job-2-c95dg 0/1 Completed 0 n/a n/a n/a n/a n/a
> n/a 100.100.142.210 ip-10-192-143-108.ca-central-1.compute.internal
> BU 17m │
> │ batch-sleep-job-2-vnfjb 0/1 Completed 0 n/a n/a n/a n/a n/a
> n/a 100.100.142.211 ip-10-192-143-108.ca-central-1.compute.internal
> BU 17m │
> │ batch-sleep-job-2-x4mcz 0/1 Completed 0 n/a n/a n/a n/a n/a
> n/a 100.100.134.216 ip-10-192-136-201.ca-central-1.compute.internal
> BU 17m │
> │ batch-sleep-job-2-ztnfq 0/1 Completed 0 n/a n/a n/a n/a n/a
> n/a 100.100.134.217 ip-10-192-136-201.ca-central-1.compute.internal
> BU 17m │
> │ batch-sleep-job-3-7tp5t 0/0 Pending 0 n/a n/a n/a n/a n/a
> n/a n/a n/a
> BU 16m │
> │ batch-sleep-job-3-59mnj 0/0 Pending 0 n/a n/a n/a n/a n/a
> n/a n/a n/a
> BU 16m │
> │ batch-sleep-job-3-bm4fd 0/0 Pending 0 n/a n/a n/a n/a n/a
> n/a n/a n/a
> BU 16m │
> │ batch-sleep-job-3-c4mxg 0/0 Pending 0 n/a n/a n/a n/a n/a
> n/a n/a n/a
> BU 16m │
> │ batch-sleep-job-3-cljfj 0/0 Pending 0 n/a n/a n/a n/a n/a
> n/a n/a n/a
> BU 16m │
> │ batch-sleep-job-3-gcvnp 0/0 Pending 0 n/a n/a n/a n/a n/a
> n/a n/a n/a
> BU 16m │
> │ batch-sleep-job-3-gwgnn 0/0 Pending 0 n/a n/a n/a n/a n/a
> n/a n/a n/a
> BU 16m │
> │ batch-sleep-job-3-kj88t 0/0 Pending 0 n/a n/a n/a n/a n/a
> n/a n/a n/a
> BU 16m │
> │ batch-sleep-job-3-p8c7w 0/0 Pending 0 n/a n/a n/a n/a n/a
> n/a n/a n/a
> BU 16m │
> │ batch-sleep-job-3-td575 0/0 Pending 0 n/a n/a n/a n/a n/a
> n/a n/a n/a
> BU 16m{noformat}
> Attaching [^stack]trace, [^yk.log]and [^metrics] API response for reference.
> This is observed with v0.10 build.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]