Our environment: k8s v1.7.10 + coreos 1198.5.0 + baremental + calico
1.2.0, and I'm running kubelet with rkt .
01:02:50 3 Jan, 2018, One of my workers entered NotReady status
unexpectedly, can anybody help me to analyze this ? I have several workers
entering NotReady status unexpectedly before, this time I want to find the
real culprit.
The kubelet log of the worker at that time:
$ systemctl status kubelet
* kubelet.service - Kubernetes Kubelet
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor
preset: enabled)
Active: active (running) since Mon 2018-01-01 18:36:42 CST; 1 day 17h ago
Docs: https://github.com/kubernetes/kubernetes
Process: 29130 ExecStartPre=/usr/bin/mkdir -p /opt/cni/bin (code=exited,
status=0/SUCCESS)
Process: 29117 ExecStartPre=/usr/bin/mkdir -p /etc/cni/net.d
(code=exited, status=0/SUCCESS)
Main PID: 29137 (kubelet)
Tasks: 72
Memory: 253.5M
CPU: 5h 41min 45.736s
CGroup: /system.slice/kubelet.service
|-29137 /kubelet --address=0.0.0.0 --allow-privileged=true
--cluster-dns=192.168.192.10 --cluster-domain=cluster.local
--cloud-provider= --port=10250 --lock-file=/var/run/lock/kubelet.lock
--exit-on-lock-contention --node-labels=worker=true
--pod-manifest-path=/etc/kubernetes/manifests
--kubeconfig=/etc/kubernetes/kubeconfig.yaml --require-kubeconfig=true
--network-plugin=cni --cni-conf-dir=/etc/cni/net.d
--cni-bin-dir=/opt/cni/bin --logtostderr=true
`-29355 journalctl -k -f
Jan 03 01:00:30 ae01.pek.prod.com kubelet-wrapper[29137]: W0102
09:00:30.988485 29137 helpers.go:793] eviction manager: no observation
found for eviction signal allocatableNodeFs.available
Jan 03 01:00:41 ae01.pek.prod.com kubelet-wrapper[29137]: W0102
09:00:41.094537 29137 helpers.go:793] eviction manager: no observation
found for eviction signal allocatableNodeFs.available
Jan 03 01:00:51 ae01.pek.prod.com kubelet-wrapper[29137]: W0102
09:00:51.193735 29137 helpers.go:793] eviction manager: no observation
found for eviction signal allocatableNodeFs.available
Jan 03 01:01:01 ae01.pek.prod.com kubelet-wrapper[29137]: W0102
09:01:01.314097 29137 helpers.go:793] eviction manager: no observation
found for eviction signal allocatableNodeFs.available
Jan 03 01:01:11 ae01.pek.prod.com kubelet-wrapper[29137]: W0102
09:01:11.406249 29137 helpers.go:793] eviction manager: no observation
found for eviction signal allocatableNodeFs.available
Jan 03 01:01:21 ae01.pek.prod.com kubelet-wrapper[29137]: W0102
09:01:21.510539 29137 helpers.go:793] eviction manager: no observation
found for eviction signal allocatableNodeFs.available
Jan 03 01:01:31 ae01.pek.prod.com kubelet-wrapper[29137]: W0102
09:01:31.607629 29137 helpers.go:793] eviction manager: no observation
found for eviction signal allocatableNodeFs.available
Jan 03 01:01:41 ae01.pek.prod.com kubelet-wrapper[29137]: W0102
09:01:41.707686 29137 helpers.go:793] eviction manager: no observation
found for eviction signal allocatableNodeFs.available
Jan 03 01:01:51 ae01.pek.prod.com kubelet-wrapper[29137]: W0102
09:01:51.806283 29137 helpers.go:793] eviction manager: no observation
found for eviction signal allocatableNodeFs.available
Jan 03 01:02:01 ae01.pek.prod.com kubelet-wrapper[29137]: W0102
09:02:01.905585 29137 helpers.go:793] eviction manager: no observation
found for eviction signal allocatableNodeFs.available
Then, kubelet logs stoped here.
The kube-controller-manager logs at that time:
I0103 01:02:50.475290 1 controller_utils.go:285] Recording status
change NodeNotReady event message for node ae01.pek.prod.com
I0103 01:02:50.475329 1 controller_utils.go:203] Update ready status
of pods on node [ae01.pek.prod.com]
I0103 01:02:50.475405 1 event.go:218]
Event(v1.ObjectReference{Kind:"Node", Namespace:"",
Name:"ae01.pek.prod.com", UID:"f739dd60-49d7-11e7-b36b-1866dae7138c",
APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason:
'NodeNotReady' Node ae01.pek.prod.com status is now: NodeNotReady
I0103 01:02:50.582487 1 controller_utils.go:220] Updating ready
status of pod ceph-admin to false
I0103 01:02:50.784024 1 controller_utils.go:220] Updating ready
status of pod ceph-mon-xtc5j to false
I0103 01:02:50.993792 1 controller_utils.go:220] Updating ready
status of pod ceph-osdaa-fgpt6 to false
I0103 01:02:51.152119 1 controller_utils.go:220] Updating ready
status of pod ceph-osdab-6bw67 to false
I0103 01:02:51.319113 1 controller_utils.go:220] Updating ready
status of pod ceph-osdac-0h3nj to false
I0103 01:02:51.502803 1 controller_utils.go:220] Updating ready
status of pod ceph-osdad-p5cl7 to false
I0103 01:02:59.024278 1 controller_utils.go:220] Updating ready
status of pod kube-proxy-ae01.pek.prod.com to false
I0103 01:02:59.299461 1 controller_utils.go:220] Updating ready
status of pod node-exporter-9tww1 to false
W0103 01:05:01.654148 1 reflector.go:323]
k8s.io/kubernetes/pkg/client/informers/informers_generated/externalversions/factory.go:72:
watch of *v1beta1.ThirdPartyResource ended with: 401: The event in
requested index is outdated and cleared (the requested history has been
cleared [711850627/711848663]) [711851626]
I0103 01:07:14.592488 1 nodecontroller.go:644] Node is unresponsive.
Adding Pods on Node ae01.pek.prod.com to eviction queues: 2018-01-03
01:07:14.592477202 +0800 CST is later than 2018-01-03 01:02:50.475279459
+0800 CST + 4m20s
I0103 01:07:14.759150 1 controller_utils.go:274] Recording Deleting
all Pods from Node ae01.pek.prod.com. event message for node
ae01.pek.prod.com
I0103 01:07:14.759246 1 event.go:218]
Event(v1.ObjectReference{Kind:"Node", Namespace:"",
Name:"ae01.pek.prod.com", UID:"f739dd60-49d7-11e7-b36b-1866dae7138c",
APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason:
'DeletingAllPods' Node ae01.pek.prod.com event: Deleting all Pods from Node
ae01.pek.prod.com.
I0103 01:07:15.047285 1 controller_utils.go:89] Starting deletion of
pod ceph-sas/ceph-admin
I0103 01:07:15.047426 1 event.go:218]
Event(v1.ObjectReference{Kind:"Pod", Namespace:"ceph-sas",
Name:"ceph-admin", UID:"273810cc-c558-11e7-987d-1866dae7138c",
APIVersion:"v1", ResourceVersion:"711850606", FieldPath:""}): type:
'Normal' reason: 'NodeControllerEviction' Marking for deletion Pod
ceph-admin from Node ae01.pek.prod.com
I0103 01:07:23.419960 1 controller_utils.go:89] Starting deletion of
pod default/kube-proxy-ae01.pek.prod.com
I0103 01:07:23.420062 1 event.go:218]
Event(v1.ObjectReference{Kind:"Pod", Namespace:"default",
Name:"kube-proxy-ae01.pek.prod.com",
UID:"e302d011-d4da-11e7-a89d-1866dae7138c", APIVersion:"v1",
ResourceVersion:"711850741", FieldPath:""}): type: 'Normal' reason:
'NodeControllerEviction' Marking for deletion Pod
kube-proxy-ae01.pek.prod.com from Node ae01.pek.prod.ucarinc.com
I0103 01:07:24.104385 1 nodecontroller.go:431] Pods awaiting deletion
due to NodeController eviction
While at that time, access 10250 and 10255 port got no response:
# wget https://localhost:10250/metrics --no-check-certificate
--2018-01-03 10:53:36-- https://localhost:10250/metrics
Resolving localhost... 127.0.0.1
Connecting to localhost|127.0.0.1|:10250... connected.
Unable to establish SSL connection.
# wget http://localhost:10255/healthz
--2018-01-03 15:10:09-- http://localhost:10255/healthz
Resolving localhost... 127.0.0.1
Connecting to localhost|127.0.0.1|:10255... connected.
HTTP request sent, awaiting response... ^C
And strace the kubelet process, it just hung there:
# strace -p 29137
Process 29137 attached
futex(0x92873b0, FUTEX_WAIT, 0, NULL^CProcess 29137 detached
<detached ...>
Any help will be appreciated.
likun
--
You received this message because you are subscribed to the Google Groups
"Kubernetes user discussion and Q&A" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/kubernetes-users.
For more options, visit https://groups.google.com/d/optout.