[
https://issues.apache.org/jira/browse/YUNIKORN-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Peter Bacsko updated YUNIKORN-1161:
-----------------------------------
Description:
If we create pods where the nam of the task group does not match the
{{task-group-name}} annotation, then the real pods will not transition to
Running state when the placeholder pods expire.
For example, extend the sleep batch job like that:
{noformat}
apiVersion: batch/v1
kind: Job
metadata:
name: batch-sleep-job-9
spec:
completions: 5
parallelism: 5
template:
metadata:
labels:
app: sleep
applicationId: "batch-sleep-job-9"
queue: root.sandbox
annotations:
yunikorn.apache.org/task-group-name: sleep-groupxxx
yunikorn.apache.org/task-groups: |-
[{
"name": "sleep-group",
"minMember": 5,
"minResource": {
"cpu": "100m",
"memory": "2000M"
},
"nodeSelector": {},
"tolerations": []
}]
...
{noformat}
Submit the job and restart Yunikorn when the placeholders are already running.
This will result in "batch-sleep-job-9-nnnnn" pods that are not transitioning
to {{Running}} and they have to be manually terminated.
{noformat}
$ kubectl get pods -A | grep -E "(batch-sleep-job-9|yunikorn)"
default batch-sleep-job-9-hgxxl 0/1
Pending 0 20m
default batch-sleep-job-9-j6twt 0/1
Pending 0 20m
default batch-sleep-job-9-l4jhm 0/1
Pending 0 20m
default batch-sleep-job-9-swlm4 0/1
Pending 0 20m
default batch-sleep-job-9-z6wqx 0/1
Pending 0 20m
default yunikorn-admission-controller-78c775cfd9-6pp8d 1/1
Running 4 3d22h
default yunikorn-scheduler-77dd7c665b-f8kkn 2/2
Running 0 18m
{noformat}
Note that without YK restart, they deallocated and removed properly.
was:
If we create pods where the nam of the task group does not match the
{{task-group-name}} annotation, then the real pods will not transition to
Running state when the placeholder pods expire.
For example, extend the sleep batch job like that:
{noformat}
apiVersion: batch/v1
kind: Job
metadata:
name: batch-sleep-job-9
spec:
completions: 5
parallelism: 5
template:
metadata:
labels:
app: sleep
applicationId: "batch-sleep-job-9"
queue: root.sandbox
annotations:
yunikorn.apache.org/task-group-name: sleep-groupxxx
yunikorn.apache.org/task-groups: |-
[{
"name": "sleep-group",
"minMember": 5,
"minResource": {
"cpu": "100m",
"memory": "2000M"
},
"nodeSelector": {},
"tolerations": []
}]
...
{noformat}
Submit the job and restart Yunikorn when the placeholders are already running.
This will result in "batch-sleep-job-9-nnnnn" pods that are not transitioning
to {{Running}} and they have to be manually terminated.
{noformat}
default batch-sleep-job-9-hgxxl 0/1
Pending 0 19m
default batch-sleep-job-9-j6twt 0/1
Pending 0 19m
default batch-sleep-job-9-l4jhm 0/1
Pending 0 19m
default batch-sleep-job-9-swlm4 0/1
Pending 0 19m
default batch-sleep-job-9-z6wqx 0/1
Pending 0 19m
default yunikorn-admission-controller-78c775cfd9-6pp8d 1/1
Running 4 3d22h
default yunikorn-scheduler-77dd7c665b-f8kkn 2/2
Running 0 17m
{noformat}
Note that without YK restart, they deallocated and removed properly.
> Pods not linked to placeholders are stuck in Running state if YK is restarted
> -----------------------------------------------------------------------------
>
> Key: YUNIKORN-1161
> URL: https://issues.apache.org/jira/browse/YUNIKORN-1161
> Project: Apache YuniKorn
> Issue Type: Sub-task
> Components: shim - kubernetes
> Reporter: Peter Bacsko
> Priority: Major
>
> If we create pods where the nam of the task group does not match the
> {{task-group-name}} annotation, then the real pods will not transition to
> Running state when the placeholder pods expire.
> For example, extend the sleep batch job like that:
> {noformat}
> apiVersion: batch/v1
> kind: Job
> metadata:
> name: batch-sleep-job-9
> spec:
> completions: 5
> parallelism: 5
> template:
> metadata:
> labels:
> app: sleep
> applicationId: "batch-sleep-job-9"
> queue: root.sandbox
> annotations:
> yunikorn.apache.org/task-group-name: sleep-groupxxx
> yunikorn.apache.org/task-groups: |-
> [{
> "name": "sleep-group",
> "minMember": 5,
> "minResource": {
> "cpu": "100m",
> "memory": "2000M"
> },
> "nodeSelector": {},
> "tolerations": []
> }]
> ...
> {noformat}
> Submit the job and restart Yunikorn when the placeholders are already running.
> This will result in "batch-sleep-job-9-nnnnn" pods that are not transitioning
> to {{Running}} and they have to be manually terminated.
> {noformat}
> $ kubectl get pods -A | grep -E "(batch-sleep-job-9|yunikorn)"
> default batch-sleep-job-9-hgxxl 0/1
> Pending 0 20m
> default batch-sleep-job-9-j6twt 0/1
> Pending 0 20m
> default batch-sleep-job-9-l4jhm 0/1
> Pending 0 20m
> default batch-sleep-job-9-swlm4 0/1
> Pending 0 20m
> default batch-sleep-job-9-z6wqx 0/1
> Pending 0 20m
> default yunikorn-admission-controller-78c775cfd9-6pp8d 1/1
> Running 4 3d22h
> default yunikorn-scheduler-77dd7c665b-f8kkn 2/2
> Running 0 18m
> {noformat}
> Note that without YK restart, they deallocated and removed properly.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]