Our environment: k8s v1.7.10 + coreos 1198.5.0 + baremental + calico 1.2.0, and I'm running kubelet with rkt .
01:02:50 3 Jan, 2018, One of my workers entered NotReady status unexpectedly, can anybody help me to analyze this ? I have several workers entering NotReady status unexpectedly before, this time I want to find the real culprit. The kubelet log of the worker at that time: $ systemctl status kubelet * kubelet.service - Kubernetes Kubelet Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: enabled) Active: active (running) since Mon 2018-01-01 18:36:42 CST; 1 day 17h ago Docs: https://github.com/kubernetes/kubernetes Process: 29130 ExecStartPre=/usr/bin/mkdir -p /opt/cni/bin (code=exited, status=0/SUCCESS) Process: 29117 ExecStartPre=/usr/bin/mkdir -p /etc/cni/net.d (code=exited, status=0/SUCCESS) Main PID: 29137 (kubelet) Tasks: 72 Memory: 253.5M CPU: 5h 41min 45.736s CGroup: /system.slice/kubelet.service |-29137 /kubelet --address=0.0.0.0 --allow-privileged=true --cluster-dns=192.168.192.10 --cluster-domain=cluster.local --cloud-provider= --port=10250 --lock-file=/var/run/lock/kubelet.lock --exit-on-lock-contention --node-labels=worker=true --pod-manifest-path=/etc/kubernetes/manifests --kubeconfig=/etc/kubernetes/kubeconfig.yaml --require-kubeconfig=true --network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin --logtostderr=true `-29355 journalctl -k -f Jan 03 01:00:30 ae01.pek.prod.com kubelet-wrapper[29137]: W0102 09:00:30.988485 29137 helpers.go:793] eviction manager: no observation found for eviction signal allocatableNodeFs.available Jan 03 01:00:41 ae01.pek.prod.com kubelet-wrapper[29137]: W0102 09:00:41.094537 29137 helpers.go:793] eviction manager: no observation found for eviction signal allocatableNodeFs.available Jan 03 01:00:51 ae01.pek.prod.com kubelet-wrapper[29137]: W0102 09:00:51.193735 29137 helpers.go:793] eviction manager: no observation found for eviction signal allocatableNodeFs.available Jan 03 01:01:01 ae01.pek.prod.com kubelet-wrapper[29137]: W0102 09:01:01.314097 29137 helpers.go:793] eviction manager: no observation found for eviction signal allocatableNodeFs.available Jan 03 01:01:11 ae01.pek.prod.com kubelet-wrapper[29137]: W0102 09:01:11.406249 29137 helpers.go:793] eviction manager: no observation found for eviction signal allocatableNodeFs.available Jan 03 01:01:21 ae01.pek.prod.com kubelet-wrapper[29137]: W0102 09:01:21.510539 29137 helpers.go:793] eviction manager: no observation found for eviction signal allocatableNodeFs.available Jan 03 01:01:31 ae01.pek.prod.com kubelet-wrapper[29137]: W0102 09:01:31.607629 29137 helpers.go:793] eviction manager: no observation found for eviction signal allocatableNodeFs.available Jan 03 01:01:41 ae01.pek.prod.com kubelet-wrapper[29137]: W0102 09:01:41.707686 29137 helpers.go:793] eviction manager: no observation found for eviction signal allocatableNodeFs.available Jan 03 01:01:51 ae01.pek.prod.com kubelet-wrapper[29137]: W0102 09:01:51.806283 29137 helpers.go:793] eviction manager: no observation found for eviction signal allocatableNodeFs.available Jan 03 01:02:01 ae01.pek.prod.com kubelet-wrapper[29137]: W0102 09:02:01.905585 29137 helpers.go:793] eviction manager: no observation found for eviction signal allocatableNodeFs.available Then, kubelet logs stoped here. The kube-controller-manager logs at that time: I0103 01:02:50.475290 1 controller_utils.go:285] Recording status change NodeNotReady event message for node ae01.pek.prod.com I0103 01:02:50.475329 1 controller_utils.go:203] Update ready status of pods on node [ae01.pek.prod.com] I0103 01:02:50.475405 1 event.go:218] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ae01.pek.prod.com", UID:"f739dd60-49d7-11e7-b36b-1866dae7138c", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node ae01.pek.prod.com status is now: NodeNotReady I0103 01:02:50.582487 1 controller_utils.go:220] Updating ready status of pod ceph-admin to false I0103 01:02:50.784024 1 controller_utils.go:220] Updating ready status of pod ceph-mon-xtc5j to false I0103 01:02:50.993792 1 controller_utils.go:220] Updating ready status of pod ceph-osdaa-fgpt6 to false I0103 01:02:51.152119 1 controller_utils.go:220] Updating ready status of pod ceph-osdab-6bw67 to false I0103 01:02:51.319113 1 controller_utils.go:220] Updating ready status of pod ceph-osdac-0h3nj to false I0103 01:02:51.502803 1 controller_utils.go:220] Updating ready status of pod ceph-osdad-p5cl7 to false I0103 01:02:59.024278 1 controller_utils.go:220] Updating ready status of pod kube-proxy-ae01.pek.prod.com to false I0103 01:02:59.299461 1 controller_utils.go:220] Updating ready status of pod node-exporter-9tww1 to false W0103 01:05:01.654148 1 reflector.go:323] k8s.io/kubernetes/pkg/client/informers/informers_generated/externalversions/factory.go:72: watch of *v1beta1.ThirdPartyResource ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [711850627/711848663]) [711851626] I0103 01:07:14.592488 1 nodecontroller.go:644] Node is unresponsive. Adding Pods on Node ae01.pek.prod.com to eviction queues: 2018-01-03 01:07:14.592477202 +0800 CST is later than 2018-01-03 01:02:50.475279459 +0800 CST + 4m20s I0103 01:07:14.759150 1 controller_utils.go:274] Recording Deleting all Pods from Node ae01.pek.prod.com. event message for node ae01.pek.prod.com I0103 01:07:14.759246 1 event.go:218] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"ae01.pek.prod.com", UID:"f739dd60-49d7-11e7-b36b-1866dae7138c", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'DeletingAllPods' Node ae01.pek.prod.com event: Deleting all Pods from Node ae01.pek.prod.com. I0103 01:07:15.047285 1 controller_utils.go:89] Starting deletion of pod ceph-sas/ceph-admin I0103 01:07:15.047426 1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"ceph-sas", Name:"ceph-admin", UID:"273810cc-c558-11e7-987d-1866dae7138c", APIVersion:"v1", ResourceVersion:"711850606", FieldPath:""}): type: 'Normal' reason: 'NodeControllerEviction' Marking for deletion Pod ceph-admin from Node ae01.pek.prod.com I0103 01:07:23.419960 1 controller_utils.go:89] Starting deletion of pod default/kube-proxy-ae01.pek.prod.com I0103 01:07:23.420062 1 event.go:218] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"kube-proxy-ae01.pek.prod.com", UID:"e302d011-d4da-11e7-a89d-1866dae7138c", APIVersion:"v1", ResourceVersion:"711850741", FieldPath:""}): type: 'Normal' reason: 'NodeControllerEviction' Marking for deletion Pod kube-proxy-ae01.pek.prod.com from Node ae01.pek.prod.ucarinc.com I0103 01:07:24.104385 1 nodecontroller.go:431] Pods awaiting deletion due to NodeController eviction While at that time, access 10250 and 10255 port got no response: # wget https://localhost:10250/metrics --no-check-certificate --2018-01-03 10:53:36-- https://localhost:10250/metrics Resolving localhost... 127.0.0.1 Connecting to localhost|127.0.0.1|:10250... connected. Unable to establish SSL connection. # wget http://localhost:10255/healthz --2018-01-03 15:10:09-- http://localhost:10255/healthz Resolving localhost... 127.0.0.1 Connecting to localhost|127.0.0.1|:10255... connected. HTTP request sent, awaiting response... ^C And strace the kubelet process, it just hung there: # strace -p 29137 Process 29137 attached futex(0x92873b0, FUTEX_WAIT, 0, NULL^CProcess 29137 detached <detached ...> Any help will be appreciated. likun -- You received this message because you are subscribed to the Google Groups "Kubernetes user discussion and Q&A" group. To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-users+unsubscr...@googlegroups.com. To post to this group, send email to kubernetes-users@googlegroups.com. Visit this group at https://groups.google.com/group/kubernetes-users. For more options, visit https://groups.google.com/d/optout.