Ayub Pathan created YUNIKORN-516:
------------------------------------
Summary: Yunikorn scheduler seems to be in deadlock state
Key: YUNIKORN-516
URL: https://issues.apache.org/jira/browse/YUNIKORN-516
Project: Apache YuniKorn
Issue Type: Bug
Components: core - scheduler
Reporter: Ayub Pathan
Apply below job templates to reproduce the issue.
# First application with gang scheduling annotations
{noformat}
apiVersion: batch/v1
kind: Job
metadata:
name: batch-sleep-job-1
spec:
completions: 2
parallelism: 2
template:
metadata:
labels:
app: sleep
applicationId: "batch-sleep-job-1"
queue: root.sandbox
annotations:
yunikorn.apache.org/task-group-name: tg1
yunikorn.apache.org/task-groups: |-
[{
"name": "tg1",
"minMember": 2,
"minResource": {
"cpu": "100m",
"memory": "500M"
},
"nodeSelector": {},
"tolerations": []
}]
spec:
schedulerName: yunikorn
restartPolicy: Never
containers:
- name: sleep300
image: "alpine:latest"
command: ["sleep", "300"]
resources:
requests:
cpu: "100m"
memory: "500M" {noformat}
2. First application to the same task group
{noformat}
apiVersion: batch/v1
kind: Job
metadata:
name: batch-sleep-job-2
spec:
completions: 4
parallelism: 4
template:
metadata:
labels:
app: sleep
applicationId: "batch-sleep-job-2"
queue: root.sandbox
annotations:
yunikorn.apache.org/task-group-name: tg1
yunikorn.apache.org/task-groups: |-
[{
"name": "tg1",
"minMember": 2,
"minResource": {
"cpu": "100m",
"memory": "500M"
},
"nodeSelector": {},
"tolerations": []
}]
spec:
schedulerName: yunikorn
restartPolicy: Never
containers:
- name: sleep300
image: "alpine:latest"
command: ["sleep", "300"]
resources:
requests:
cpu: "100m"
memory: "500M"{noformat}
3. Third application to the same task group
{noformat}
apiVersion: batch/v1
kind: Job
metadata:
name: batch-sleep-job-3
spec:
completions: 10
parallelism: 10
template:
metadata:
labels:
app: sleep
applicationId: "batch-sleep-job-3"
queue: root.sandbox
annotations:
yunikorn.apache.org/task-group-name: tg1
yunikorn.apache.org/task-groups: |-
[{
"name": "tg1",
"minMember": 3,
"minResource": {
"cpu": "100m",
"memory": "500M"
},
"nodeSelector": {},
"tolerations": []
}]
spec:
schedulerName: yunikorn
restartPolicy: Never
containers:
- name: sleep300
image: "alpine:latest"
command: ["sleep", "300"]
resources:
requests:
cpu: "100m"
memory: "500M" {noformat}
Now it can be seen that, the 3rd application is in pending state even though
the place holder apps are created and terminated.
{noformat}
NAME↑ READY STATUS RS CPU MEM %CPU/R %MEM/R %CPU/L
%MEM/L IP NODE QOS
AGE │
│ batch-sleep-job-1-7lrd5 0/1 Completed 0 n/a n/a n/a n/a n/a
n/a 100.100.142.208 ip-10-192-143-108.ca-central-1.compute.internal BU
18m │
│ batch-sleep-job-1-lw4t9 0/1 Completed 0 n/a n/a n/a n/a n/a
n/a 100.100.134.213 ip-10-192-136-201.ca-central-1.compute.internal BU
18m │
│ batch-sleep-job-2-c95dg 0/1 Completed 0 n/a n/a n/a n/a n/a
n/a 100.100.142.210 ip-10-192-143-108.ca-central-1.compute.internal BU
17m │
│ batch-sleep-job-2-vnfjb 0/1 Completed 0 n/a n/a n/a n/a n/a
n/a 100.100.142.211 ip-10-192-143-108.ca-central-1.compute.internal BU
17m │
│ batch-sleep-job-2-x4mcz 0/1 Completed 0 n/a n/a n/a n/a n/a
n/a 100.100.134.216 ip-10-192-136-201.ca-central-1.compute.internal BU
17m │
│ batch-sleep-job-2-ztnfq 0/1 Completed 0 n/a n/a n/a n/a n/a
n/a 100.100.134.217 ip-10-192-136-201.ca-central-1.compute.internal BU
17m │
│ batch-sleep-job-3-7tp5t 0/0 Pending 0 n/a n/a n/a n/a n/a
n/a n/a n/a BU
16m │
│ batch-sleep-job-3-59mnj 0/0 Pending 0 n/a n/a n/a n/a n/a
n/a n/a n/a BU
16m │
│ batch-sleep-job-3-bm4fd 0/0 Pending 0 n/a n/a n/a n/a n/a
n/a n/a n/a BU
16m │
│ batch-sleep-job-3-c4mxg 0/0 Pending 0 n/a n/a n/a n/a n/a
n/a n/a n/a BU
16m │
│ batch-sleep-job-3-cljfj 0/0 Pending 0 n/a n/a n/a n/a n/a
n/a n/a n/a BU
16m │
│ batch-sleep-job-3-gcvnp 0/0 Pending 0 n/a n/a n/a n/a n/a
n/a n/a n/a BU
16m │
│ batch-sleep-job-3-gwgnn 0/0 Pending 0 n/a n/a n/a n/a n/a
n/a n/a n/a BU
16m │
│ batch-sleep-job-3-kj88t 0/0 Pending 0 n/a n/a n/a n/a n/a
n/a n/a n/a BU
16m │
│ batch-sleep-job-3-p8c7w 0/0 Pending 0 n/a n/a n/a n/a n/a
n/a n/a n/a BU
16m │
│ batch-sleep-job-3-td575 0/0 Pending 0 n/a n/a n/a n/a n/a
n/a n/a n/a BU
16m{noformat}
Attaching [^stack]trace, [^yk.log]and [^metrics] API response for reference.
This is observed with v0.10 build.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]