[jira] [Updated] (YUNIKORN-1161) Pods not linked to placeholders are stuck in Running state if YK is restarted

Peter Bacsko (Jira) Tue, 29 Mar 2022 02:28:03 -0700


     [ 
https://issues.apache.org/jira/browse/YUNIKORN-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Peter Bacsko updated YUNIKORN-1161:
-----------------------------------
    Description: 
If we create pods where the nam of the task group does not match the 
{{task-group-name}} annotation, then the real pods will not transition to 
Running state when the placeholder pods expire.

For example, extend the sleep batch job like that:
{noformat}
apiVersion: batch/v1
kind: Job
metadata:
  name: batch-sleep-job-9
spec:
  completions: 5
  parallelism: 5
  template:
    metadata:
      labels:
        app: sleep
        applicationId: "batch-sleep-job-9"
        queue: root.sandbox
      annotations:
        yunikorn.apache.org/task-group-name: sleep-groupxxx
        yunikorn.apache.org/task-groups: |-
          [{
              "name": "sleep-group",
              "minMember": 5,
              "minResource": {
                "cpu": "100m",
                "memory": "2000M"
              },
              "nodeSelector": {},
              "tolerations": []
          }]
...
{noformat}

Submit the job and restart Yunikorn when the placeholders are already running.
This will result in "batch-sleep-job-9-nnnnn" pods that are not transitioning 
to {{Running}} and they have to be manually terminated.

{noformat}
$ kubectl get pods -A | grep -E "(batch-sleep-job-9|yunikorn)"
default                batch-sleep-job-9-hgxxl                          0/1     
Pending     0          20m
default                batch-sleep-job-9-j6twt                          0/1     
Pending     0          20m
default                batch-sleep-job-9-l4jhm                          0/1     
Pending     0          20m
default                batch-sleep-job-9-swlm4                          0/1     
Pending     0          20m
default                batch-sleep-job-9-z6wqx                          0/1     
Pending     0          20m
default                yunikorn-admission-controller-78c775cfd9-6pp8d   1/1     
Running     4          3d22h
default                yunikorn-scheduler-77dd7c665b-f8kkn              2/2     
Running     0          18m
{noformat}

Note that without YK restart, they deallocated and removed properly.

  was:
If we create pods where the nam of the task group does not match the 
{{task-group-name}} annotation, then the real pods will not transition to 
Running state when the placeholder pods expire.

For example, extend the sleep batch job like that:
{noformat}
apiVersion: batch/v1
kind: Job
metadata:
  name: batch-sleep-job-9
spec:
  completions: 5
  parallelism: 5
  template:
    metadata:
      labels:
        app: sleep
        applicationId: "batch-sleep-job-9"
        queue: root.sandbox
      annotations:
        yunikorn.apache.org/task-group-name: sleep-groupxxx
        yunikorn.apache.org/task-groups: |-
          [{
              "name": "sleep-group",
              "minMember": 5,
              "minResource": {
                "cpu": "100m",
                "memory": "2000M"
              },
              "nodeSelector": {},
              "tolerations": []
          }]
...
{noformat}

Submit the job and restart Yunikorn when the placeholders are already running.
This will result in "batch-sleep-job-9-nnnnn" pods that are not transitioning 
to {{Running}} and they have to be manually terminated.

{noformat}
default                batch-sleep-job-9-hgxxl                          0/1     
Pending     0          19m
default                batch-sleep-job-9-j6twt                          0/1     
Pending     0          19m
default                batch-sleep-job-9-l4jhm                          0/1     
Pending     0          19m
default                batch-sleep-job-9-swlm4                          0/1     
Pending     0          19m
default                batch-sleep-job-9-z6wqx                          0/1     
Pending     0          19m
default                yunikorn-admission-controller-78c775cfd9-6pp8d   1/1     
Running     4          3d22h
default                yunikorn-scheduler-77dd7c665b-f8kkn              2/2     
Running     0          17m
{noformat}

Note that without YK restart, they deallocated and removed properly.


> Pods not linked to placeholders are stuck in Running state if YK is restarted
> -----------------------------------------------------------------------------
>
>                 Key: YUNIKORN-1161
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1161
>             Project: Apache YuniKorn
>          Issue Type: Sub-task
>          Components: shim - kubernetes
>            Reporter: Peter Bacsko
>            Priority: Major
>
> If we create pods where the nam of the task group does not match the 
> {{task-group-name}} annotation, then the real pods will not transition to 
> Running state when the placeholder pods expire.
> For example, extend the sleep batch job like that:
> {noformat}
> apiVersion: batch/v1
> kind: Job
> metadata:
>   name: batch-sleep-job-9
> spec:
>   completions: 5
>   parallelism: 5
>   template:
>     metadata:
>       labels:
>         app: sleep
>         applicationId: "batch-sleep-job-9"
>         queue: root.sandbox
>       annotations:
>         yunikorn.apache.org/task-group-name: sleep-groupxxx
>         yunikorn.apache.org/task-groups: |-
>           [{
>               "name": "sleep-group",
>               "minMember": 5,
>               "minResource": {
>                 "cpu": "100m",
>                 "memory": "2000M"
>               },
>               "nodeSelector": {},
>               "tolerations": []
>           }]
> ...
> {noformat}
> Submit the job and restart Yunikorn when the placeholders are already running.
> This will result in "batch-sleep-job-9-nnnnn" pods that are not transitioning 
> to {{Running}} and they have to be manually terminated.
> {noformat}
> $ kubectl get pods -A | grep -E "(batch-sleep-job-9|yunikorn)"
> default                batch-sleep-job-9-hgxxl                          0/1   
>   Pending     0          20m
> default                batch-sleep-job-9-j6twt                          0/1   
>   Pending     0          20m
> default                batch-sleep-job-9-l4jhm                          0/1   
>   Pending     0          20m
> default                batch-sleep-job-9-swlm4                          0/1   
>   Pending     0          20m
> default                batch-sleep-job-9-z6wqx                          0/1   
>   Pending     0          20m
> default                yunikorn-admission-controller-78c775cfd9-6pp8d   1/1   
>   Running     4          3d22h
> default                yunikorn-scheduler-77dd7c665b-f8kkn              2/2   
>   Running     0          18m
> {noformat}
> Note that without YK restart, they deallocated and removed properly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (YUNIKORN-1161) Pods not linked to placeholders are stuck in Running state if YK is restarted

Reply via email to