[
https://issues.apache.org/jira/browse/YUNIKORN-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272534#comment-17272534
]
Ayub Pathan commented on YUNIKORN-518:
--------------------------------------
In one of the instances, scheduler was restarted 32 times.
{noformat}
NAMESPACE NAME
READY STATUS RESTARTS AGE
default batch-sleep-job-1-7lrd5
0/1 Completed 0 3h55m
default batch-sleep-job-1-lw4t9
0/1 Completed 0 3h55m
default batch-sleep-job-2-c95dg
0/1 Completed 0 3h54m
default batch-sleep-job-2-vnfjb
0/1 Completed 0 3h54m
default batch-sleep-job-2-x4mcz
0/1 Completed 0 3h54m
default batch-sleep-job-2-ztnfq
0/1 Completed 0 3h54m
default batch-sleep-job-3-59mnj
0/1 Pending 0 3h53m
default batch-sleep-job-3-7tp5t
0/1 Pending 0 3h53m
default batch-sleep-job-3-bm4fd
0/1 Pending 0 3h53m
default batch-sleep-job-3-c4mxg
0/1 Pending 0 3h53m
default batch-sleep-job-3-cljfj
0/1 Pending 0 3h53m
default batch-sleep-job-3-gcvnp
0/1 Pending 0 3h53m
default batch-sleep-job-3-gwgnn
0/1 Pending 0 3h53m
default batch-sleep-job-3-kj88t
0/1 Pending 0 3h53m
default batch-sleep-job-3-p8c7w
0/1 Pending 0 3h53m
default batch-sleep-job-3-td575
0/1 Pending 0 3h53m
kube-system calico-kube-controllers-db689dff4-jns7j
1/1 Running 0 12d
kube-system calico-node-5k8mv
1/1 Running 0 12d
kube-system calico-node-8trmk
1/1 Running 0 12d
kube-system calico-node-jxgsg
1/1 Running 0 12d
kube-system calico-node-qfhxd
1/1 Running 0 12d
kube-system calico-typha-7d9d44d8db-c6nvk
1/1 Running 0 12d
kube-system calico-typha-7d9d44d8db-rq6kw
1/1 Running 0 12d
kube-system cluster-autoscaler-f7499f8dc-cqpsl
1/1 Running 1 12d
kube-system coredns-58b7b9bd9d-bx2hz
1/1 Running 0 12d
kube-system coredns-58b7b9bd9d-zbct9
1/1 Running 0 12d
kube-system heapster-67c8c5b767-lhvp4
1/1 Running 0 12d
kube-system kube-proxy-2fd7m
1/1 Running 0 12d
kube-system kube-proxy-d8bj9
1/1 Running 0 12d
kube-system kube-proxy-lx95n
1/1 Running 0 12d
kube-system kube-proxy-xktfn
1/1 Running 0 12d
kube-system kubernetes-dashboard-d88d55985-jnv98
1/1 Running 0 12d
kube-system monitoring-influxdb-6574bdd4bf-xl8hz
1/1 Running 1 12d
kube-system spot-instance-termination-handler-aws-node-termination-han2jm6w
1/1 Running 0 12d
kube-system spot-instance-termination-handler-aws-node-termination-hanjr9sb
1/1 Running 0 12d
kube-system spot-instance-termination-handler-aws-node-termination-hanjwpns
1/1 Running 0 12d
kube-system spot-instance-termination-handler-aws-node-termination-hannhgr9
1/1 Running 0 12d
tiller tiller-deploy-f897b44dd-8tcrk
1/1 Running 0 12d
yunikorn yunikorn-scheduler-6577f789d8-pl76n
1/2 CrashLoopBackOff 32 142m {noformat}
> Scheduler restarts observed due to admission controller "FailedPostStartHook"
> -----------------------------------------------------------------------------
>
> Key: YUNIKORN-518
> URL: https://issues.apache.org/jira/browse/YUNIKORN-518
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Affects Versions: 0.10
> Reporter: Ayub Pathan
> Priority: Major
>
> {noformat}
> Name: yunikorn-scheduler-6577f789d8-vc5cc
> Namespace: yunikorn
> Priority: 0
> Node: ip-10-192-153-109.ca-central-1.compute.internal/10.192.153.109
> Start Time: Tue, 26 Jan 2021 19:17:12 -0800
> Labels: app=yunikorn
> component=yunikorn-scheduler
> pod-template-hash=6577f789d8
> release=yunikorn
> Annotations: cni.projectcalico.org/podIP: 100.100.166.78/32
> cni.projectcalico.org/podIPs: 100.100.166.78/32
> kubernetes.io/psp: eks.privileged
> Status: Running
> IP: 100.100.166.78
> IPs:
> IP: 100.100.166.78
> Controlled By: ReplicaSet/yunikorn-scheduler-6577f789d8
> Containers:
> yunikorn-scheduler-k8s:
> Container ID:
> docker://759f2b2f14ba37f46a42cdc59a5c51ed19d442ed717b81ee98d30177b7a184e6
> Image:
> container-dev.repo.cloudera.com/cloudera/yunikorn-scheduler:0.10.0-b9
> Image ID:
> docker-pullable://container-dev.repo.cloudera.com/cloudera/yunikorn-scheduler@sha256:878300a91cfd3b9d6dc515948afbfab23572a475b0df7006f06480ee06d1aceb
> Port: 9080/TCP
> Host Port: 0/TCP
> State: Running
> Started: Tue, 26 Jan 2021 19:18:01 -0800
> Last State: Terminated
> Reason: Error
> Exit Code: 1
> Started: Tue, 26 Jan 2021 19:17:33 -0800
> Finished: Tue, 26 Jan 2021 19:17:33 -0800
> Ready: True
> Restart Count: 3
> Limits:
> cpu: 4
> memory: 2Gi
> Requests:
> cpu: 200m
> memory: 1Gi
> Environment:
> NAMESPACE: yunikorn
> (v1:metadata.namespace)
> ADMISSION_CONTROLLER_IMAGE_REGISTRY:
> container-dev.repo.cloudera.com/cloudera/yunikorn-admission
> ADMISSION_CONTROLLER_IMAGE_TAG: 0.10.0-b9
> ADMISSION_CONTROLLER_IMAGE_PULL_POLICY: Always
> ADMISSION_CONTROLLER_IMAGE_PULL_SECRETS: [dockercreds]
> Mounts:
> /etc/yunikorn/ from config-volume (rw)
> /var/run/secrets/kubernetes.io/serviceaccount from
> yunikorn-admin-token-dnq4h (ro)
> yunikorn-scheduler-web:
> Container ID:
> docker://0b8205bb8292f193765bbc563ea10010106fd316257e523c3446c5685ee0d5bf
> Image:
> container-dev.repo.cloudera.com/cloudera/yunikorn-web:0.10.0-b9
> Image ID:
> docker-pullable://container-dev.repo.cloudera.com/cloudera/yunikorn-web@sha256:a64b986df2dc737958701838f41f9fae7f2e4a353a497949ba6b9e75b4b44b66
> Port: 9889/TCP
> Host Port: 0/TCP
> State: Running
> Started: Tue, 26 Jan 2021 19:17:17 -0800
> Ready: True
> Restart Count: 0
> Limits:
> cpu: 200m
> memory: 500Mi
> Requests:
> cpu: 100m
> memory: 100Mi
> Environment: <none>
> Mounts:
> /var/run/secrets/kubernetes.io/serviceaccount from
> yunikorn-admin-token-dnq4h (ro)
> Conditions:
> Type Status
> Initialized True
> Ready True
> ContainersReady True
> PodScheduled True
> Volumes:
> config-volume:
> Type: ConfigMap (a volume populated by a ConfigMap)
> Name: yunikorn-configs
> Optional: false
> yunikorn-admin-token-dnq4h:
> Type: Secret (a volume populated by a Secret)
> SecretName: yunikorn-admin-token-dnq4h
> Optional: false
> QoS Class: Burstable
> Node-Selectors: role.node.kubernetes.io/liftie-infra=true
> Tolerations: CriticalAddonsOnly op=Exists
> node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
> node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
> role.node.kubernetes.io/liftie-infra=true:NoSchedule
> Events:
> Type Reason Age From Message
> ---- ------ ---- ---- -------
> Normal Scheduled 61s default-scheduler
> Successfully assigned yunikorn/yunikorn-scheduler-6577f789d8-vc5cc to
> ip-10-192-153-109.ca-central-1.compute.internal
> Normal Pulling 57s kubelet Pulling
> image "container-dev.repo.cloudera.com/cloudera/yunikorn-web:0.10.0-b9"
> Normal Started 56s kubelet Started
> container yunikorn-scheduler-web
> Normal Created 56s kubelet Created
> container yunikorn-scheduler-web
> Normal Pulled 56s kubelet
> Successfully pulled image
> "container-dev.repo.cloudera.com/cloudera/yunikorn-web:0.10.0-b9"
> Warning FailedPreStopHook 55s (x2 over 58s) kubelet Exec
> lifecycle hook ([/bin/sh /admission_util.sh delete]) for Container
> "yunikorn-scheduler-k8s" in Pod
> "yunikorn-scheduler-6577f789d8-vc5cc_yunikorn(082e1cc7-8765-4aa3-baac-48e3b048cfc6)"
> failed - error: command '/bin/sh /admission_util.sh delete' exited with 126:
> , message: "cannot exec in a stopped state: unknown\r\n"
> Normal Killing 55s (x2 over 58s) kubelet
> FailedPostStartHook
> Warning BackOff 53s (x2 over 54s) kubelet
> Back-off restarting failed container
> Normal Pulling 41s (x3 over 60s) kubelet Pulling
> image "container-dev.repo.cloudera.com/cloudera/yunikorn-scheduler:0.10.0-b9"
> Warning FailedPostStartHook 40s (x3 over 58s) kubelet Exec
> lifecycle hook ([/bin/sh /admission_util.sh create]) for Container
> "yunikorn-scheduler-k8s" in Pod
> "yunikorn-scheduler-6577f789d8-vc5cc_yunikorn(082e1cc7-8765-4aa3-baac-48e3b048cfc6)"
> failed - error: command '/bin/sh /admission_util.sh create' exited with 137:
> , message: ""
> Normal Started 40s (x3 over 58s) kubelet Started
> container yunikorn-scheduler-k8s
> Normal Created 40s (x3 over 58s) kubelet Created
> container yunikorn-scheduler-k8s
> Normal Pulled 40s (x3 over 58s) kubelet
> Successfully pulled image
> "container-dev.repo.cloudera.com/cloudera/yunikorn-scheduler:0.10.0-b9"
> {noformat}
> This is not a blocker but the scheduler was restarted multiple(3) times,
> hence reporting. This could be due to issue in admission controller start
> script/
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]