[jira] [Commented] (YUNIKORN-560) Yunikorn recovery deletes existing placeholders

Peter Bacsko (Jira) Fri, 25 Mar 2022 07:22:08 -0700


    [ 
https://issues.apache.org/jira/browse/YUNIKORN-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512406#comment-17512406
 ]


Peter Bacsko commented on YUNIKORN-560:
---------------------------------------

I tried to repro this problem with Minikube version 1.22.0.

I tried with {{kubectl deployment scale}} and {{kubectl delete pod}} to restart 
YK, but I haven't seen anything. No placeholder pods were deleted.

What I did see is though, it that after restart, placeholders do not time out 
and just keep running:
{noformat}
default                batch-sleep-job-3-5drfs                          0/1     
Pending     0          37m
default                batch-sleep-job-3-cfl7c                          0/1     
Pending     0          37m
default                batch-sleep-job-3-fvddw                          0/1     
Pending     0          37m
default                batch-sleep-job-3-jqhnb                          0/1     
Pending     0          37m
default                batch-sleep-job-3-v5qz4                          0/1     
Pending     0          37m
default                tg-sleep-group-batch-sleep-job-3-0               1/1     
Running     0          37m
default                tg-sleep-group-batch-sleep-job-3-1               1/1     
Running     0          37m
default                tg-sleep-group-batch-sleep-job-3-2               1/1     
Running     0          37m
default                tg-sleep-group-batch-sleep-job-3-3               1/1     
Running     0          37m
default                tg-sleep-group-batch-sleep-job-3-4               1/1     
Running     0          37m
default                tg-sleep-group-batch-sleep-job-3-5               1/1     
Running     0          37m
default                yunikorn-admission-controller-78c775cfd9-6pp8d   1/1     
Running     2          158m
default                yunikorn-scheduler-77dd7c665b-jcmt6              2/2     
Running     0          11m
{noformat}

Eventually, {{tg-sleep}} pods disappeared, but the {{batch-sleep-job}} pods did 
not transition into Running state:
{noformat}
default                batch-sleep-job-3-5drfs                          0/1     
Pending     0          57m
default                batch-sleep-job-3-cfl7c                          0/1     
Pending     0          57m
default                batch-sleep-job-3-fvddw                          0/1     
Pending     0          57m
default                batch-sleep-job-3-jqhnb                          0/1     
Pending     0          57m
default                batch-sleep-job-3-v5qz4                          0/1     
Pending     0          57m
default                yunikorn-admission-controller-78c775cfd9-6pp8d   1/1     
Running     2          178m
default                yunikorn-scheduler-77dd7c665b-jcmt6              2/2     
Running     0          31m
{noformat}

As a "bonus", we still have the total container count problem:
{noformat}
2022-03-25T13:59:31.633Z        WARN    metrics/metrics_collector.go:85 Could 
not calculate the totalContainersRunning. {"allocatedContainers": 0, 
"releasedContainers": 6}
2022-03-25T13:59:35.018Z        INFO    shim/scheduler.go:356   No outstanding 
apps found for a while   {"timeout": "2m0s"}
2022-03-25T14:00:31.633Z        WARN    metrics/metrics_collector.go:85 Could 
not calculate the totalContainersRunning. {"allocatedContainers": 0, 
"releasedContainers": 6}
2022-03-25T14:01:31.634Z        WARN    metrics/metrics_collector.go:85 Could 
not calculate the totalContainersRunning. {"allocatedContainers": 0, 
"releasedContainers": 6}
2022-03-25T14:01:35.021Z        INFO    shim/scheduler.go:356   No outstanding 
apps found for a while   {"timeout": "2m0s"}
2022-03-25T14:02:31.638Z        WARN    metrics/metrics_collector.go:85 Could 
not calculate the totalContainersRunning. {"allocatedContainers": 0, 
"releasedContainers": 6}
2022-03-25T14:03:31.634Z        WARN    metrics/metrics_collector.go:85 Could 
not calculate the totalContainersRunning. {"allocatedContainers": 0, 
"releasedContainers": 6}
2022-03-25T14:03:35.022Z        INFO    shim/scheduler.go:356   No outstanding 
apps found for a while   {"timeout": "2m0s"}
{noformat}

To me it looks like we're having three problems:
1) Placeholder timers are started at a fixed period. We don't account for the 
amount of time the pod has already spent in {{Running}} state.
2) Applications are stuck in {{Pending}}.
3) Metrics are back to zero, so we don't have allocation counters set properly.

> Yunikorn recovery deletes existing placeholders
> -----------------------------------------------
>
>                 Key: YUNIKORN-560
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-560
>             Project: Apache YuniKorn
>          Issue Type: Sub-task
>          Components: shim - kubernetes
>            Reporter: Kinga Marton
>            Assignee: Peter Bacsko
>            Priority: Major
>              Labels: recovery
>
> On recovery, Yunikorn may intermittently delete placeholder pods. To 
> reproduce, submit a gang job with minMembers > job parallelism (to guarantee 
> that there are some placeholders running) and then delete yunikorn scheduler 
> pod. 
> After recovery, there may not be any placeholder pods remaining.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YUNIKORN-560) Yunikorn recovery deletes existing placeholders

Reply via email to