[JIRA] (JENKINS-54540) Pods stuck in error state

[email protected] (JIRA) Thu, 08 Nov 2018 04:11:34 -0800

Title: Message Title

Daniel Watrous created an issue

Issue Type:	Bug
Assignee:	Carlos Sanchez
Attachments:	build-job-console-output.txt, jnlp-container-log-error.txt, jnlp-container-log-healthy.txt
Components:	kubernetes-plugin
Created:	2018-11-08 12:10
Environment:	I am running Jenkins in kubernetes with the kubernetes plugin. Versions as follows ~ # kubectl version Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"bb9ffb1654d4a729bb4cec18ff088eacc153c239", GitTreeState:"clean", BuildDate:"2018-08-07T23:17:28Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.4", GitCommit:"5ca598b4ba5abb89bb773071ce452e33fb66339d", GitTreeState:"clean", BuildDate:"2018-06-06T08:00:59Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"} Jenkins ver. 2.138.2 from https://hub.docker.com/r/jenkins/jenkins/ [centos@k8s-master-0 ~]$ sudo docker version Client: Version: 17.03.2-ce API version: 1.27 Go version: go1.7.5 Git commit: f5ec1e2 Built: Tue Jun 27 02:21:36 2017 OS/Arch: linux/amd64 Server: Version: 17.03.2-ce API version: 1.27 (minimum version 1.12) Go version: go1.7.5 Git commit: f5ec1e2 Built: Tue Jun 27 02:21:36 2017 OS/Arch: linux/amd64 Experimental: false
Labels:	jenkins kuberenetes-plugin kuberentes jnlp-slave jnlp plugin
Priority:	Minor
Reporter:	Daniel Watrous

The majority of my builds run as expected and we run many builds per day. The podTemplate for my builds is:

 
                                                                podTemplate(cloud: 'k8s-houston', label: 'api-build', yaml: """
apiVersion: v1
kind: Pod
metadata:
  name: maven
spec:
  containers:
  - name: maven
    image: maven:3-jdk-8-alpine
    volumeMounts:
      - name: volume-0
        mountPath: /mvn/.m2nrepo
    command:
    - cat
    tty: true
    resources:
      requests:
        memory: "512Mi"
        cpu: "500m"
    securityContext:
      runAsUser: 10000
      fsGroup: 10000
""",
  containers: [
    containerTemplate(name: 'jnlp', image: 'jenkins/jnlp-slave:3.23-1-alpine', args: '${computer.jnlpmac} ${computer.name}', resourceRequestCpu: '250m', resourceRequestMemory: '512Mi'),
    containerTemplate(name: 'pmd', image: 'stash.trinet-devops.com:8443/pmd:pmd-bin-5.5.4', alwaysPullImage: false, ttyEnabled: true, command: 'cat'),
    containerTemplate(name: 'owasp-zap', image: 'stash.trinet-devops.com:8443/owasp-zap:2.7.0', ttyEnabled: true, command: 'cat'),
    containerTemplate(name: 'kubectl', image: 'lachlanevenson/k8s-kubectl:v1.8.7', ttyEnabled: true, command: 'cat'),
    containerTemplate(name: 'dind', image: 'docker:18.01.0-ce-dind', privileged: true, resourceRequestCpu: '20m', resourceRequestMemory: '512Mi',),
    containerTemplate(name: 'docker-cmds', image: 'docker:18.01.0-ce', ttyEnabled: true, command: 'cat', envVars: [envVar(key: 'DOCKER_HOST', value: 'tcp://localhost:2375')]),
  ],
  volumes: [
    persistentVolumeClaim(claimName: 'jenkins-pv-claim', mountPath: '/mvn/.m2nrepo'),
    emptyDirVolume(mountPath: '/var/lib/docker', memory: false)
  ]
)
 
                                                            

However, sometimes a build Pod will get stuck in Error state in kubernetes

 
                                                                ~ # kubectl get pod -o wide
NAME                                  READY     STATUS    RESTARTS   AGE       IP               NODE
jenkins-deployment-7849487c9b-nlhln   2/2       Running   4          12d       10.233.92.12     k8s-node-hm-3
jenkins-slave-7tj0d-ckwbs             11/11     Running   0          31s       10.233.69.176    k8s-node-1
jenkins-slave-7tj0d-qn3s6             11/11     Running   0          2m        10.233.77.230    k8s-node-hm-2
jenkins-slave-gz4pw-2dnn5             6/7       Error     0          2d        10.233.123.239   k8s-node-hm-1
jenkins-slave-m825p-1hjt7             5/5       Running   0          1m        10.233.123.196   k8s-node-hm-1
jenkins-slave-r59w1-qs283             6/7       Error     0          6d        10.233.76.104    k8s-node-2
 
                                                            

You can see from the above listing of current pods that one Pod has been sitting around in Error state for 6 days. I have never seen a Pod in this state recover or get cleaned up. Manual intervention is always necessary.

When I describe the pod, I see that the "jnlp" container is in a bad state (snippet provided)

 
                                                                ~ # kubectl describe pod jenkins-slave-r59w1-qs283
Name:         jenkins-slave-r59w1-qs283
Namespace:    jenkins
Node:         k8s-node-2/10.0.40.9
Start Time:   Thu, 01 Nov 2018 12:20:06 +0000
Labels:       jenkins=slave
              jenkins/api-build=true
Annotations:  kubernetes.io/limit-ranger=LimitRanger plugin set: cpu request for container owasp-zap; cpu limit for container owasp-zap; cpu limit for container dind; cpu limit for container maven; cpu request for ...
Status:       Running
IP:           10.233.76.104
Containers:
  ...
  jnlp:
    Container ID:  docker://a08af23511d01c5f9a249c7f8f8383040a5cc70c25a0680fb0bec4c80439ec7e
    Image:         jenkins/jnlp-slave:3.23-1-alpine
    Image ID:      docker-pullable://jenkins/jnlp-slave@sha256:3cffe807013fece5182124b1e09e742f96b084ae832406a287283a258e79391c
    Port:          <none>
    Host Port:     <none>
    Args:
      b39461cef6e0c9a0ab970bf7f6ff664b463d119e8ddc4c8e966f8a77c2dc055f
      jenkins-slave-r59w1-qs283
    State:          Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Thu, 01 Nov 2018 12:20:12 +0000
      Finished:     Thu, 01 Nov 2018 12:21:01 +0000
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  4Gi
    Requests:
      cpu:     250m
      memory:  512Mi
    Environment:
      JENKINS_SECRET:      b39461cef6e0c9a0ab970bf7f6ff664b463d119e8ddc4c8e966f8a77c2dc055f
      JENKINS_TUNNEL:      jenkins-service:50000
      JENKINS_AGENT_NAME:  jenkins-slave-r59w1-qs283
      JENKINS_NAME:        jenkins-slave-r59w1-qs283
      JENKINS_URL:         http://jenkins-service:8080/
      HOME:                /home/jenkins
    Mounts:
      /home/jenkins from workspace-volume (rw)
      /mvn/.m2nrepo from volume-0 (rw)
      /var/lib/docker from volume-1 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-kmrnj (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          False
  PodScheduled   True
Volumes:
  volume-0:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  jenkins-pv-claim
    ReadOnly:   false
  volume-1:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  workspace-volume:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  default-token-kmrnj:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-kmrnj
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>
 
                                                            

The jnlp container is is in a state of Terminated with reason Error and exit code 255.

When I look at the logs for the above failed container (see attached) and compare it to a healthy container, they look the same up until the failed container shows this message.

 
                                                                Nov 01, 2018 12:20:49 PM hudson.remoting.jnlp.Main$CuiListener status
INFO: Terminated
Nov 01, 2018 12:20:59 PM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$FindEffectiveRestarters$1 onReconnect
INFO: Restarting agent via jenkins.slaves.restarter.UnixSlaveRestarter@53d577ce 
                                                            

It then seems to repeat the first attempt before printing a stacktrace, at which point the container enters the state described above.

I have also attached the Console Output from the build job associated with this pod. The build job spent "7 hr 41 min waiting" and ended up in a failed state.

It would be nice to fix this so the Error state was never reached, but the bug I'm pointing out here is that the Pod should be cleaned up when it enters the Error state. Shouldn't the Jenkins kubernetes plugin keep track of this and clean up Pods that end up in this state?

Add Comment

--
You received this message because you are subscribed to the Google Groups "Jenkins Issues" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
For more options, visit https://groups.google.com/d/optout.

[JIRA] (JENKINS-54540) Pods stuck in error state

Reply via email to