sergey-safarov opened a new issue #2102: kubernetes: cluster cannot find peer nodes after statefulset recreation URL: https://github.com/apache/couchdb/issues/2102 ## Description In kubernetes environment dns names of statefulset pods is created dynamically. When one CouchDB daemons start, other may be not available. And dns lookup will fail. Some time later dns record will created for all started CouchDB demons, but cluster still not able join all nodes to cluster. ## Steps to Reproduce I have configured CouchDB cluster in kubernetes environment using this service and statefulset yaml files *Service* ```yaml # file contains database headless service # creates kubernetes dns records for database daemons # required for database nodes discovery apiVersion: v1 kind: Service metadata: name: db spec: type: ClusterIP clusterIP: None selector: app: db ``` *StateFulSet* ```yaml file contains database daemons apiVersion: apps/v1 kind: StatefulSet metadata: name: db labels: app: db spec: podManagementPolicy: Parallel serviceName: db replicas: 5 selector: matchLabels: app: db template: metadata: labels: app: db spec: restartPolicy: Always containers: - name: node image: couchdb:2.3.1 imagePullPolicy: IfNotPresent env: - name: NODE_NETBIOS_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: NODENAME value: $(NODE_NETBIOS_NAME).db - name: COUCHDB_SECRET value: monster - name: ERL_FLAGS value: "-name couchdb" - name: ERL_FLAGS value: "-setcookie monster" volumeMounts: - name: pvc mountPath: /opt/couchdb/data livenessProbe: failureThreshold: 3 httpGet: path: / port: 5984 scheme: HTTP periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 readinessProbe: failureThreshold: 3 httpGet: path: /_up port: 5984 scheme: HTTP periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 volumeClaimTemplates: - metadata: name: pvc spec: accessModes: ["ReadWriteOnce"] resources: requests: storage: 128Gi volumeName: db ``` I check cluster memberships and found all nodes online ```sh [safarov@safarov-dell EKS]$ kubectl exec -it db-0 -- curl http://db-0.db:5984/_membership {"all_nodes":["[email protected]","[email protected]","[email protected]","[email protected]","[email protected]"],"cluster_nodes":["[email protected]","[email protected]","[email protected]","[email protected]","[email protected]"]} ``` Then I delete statefulset and check all pods is deleted ```sh [safarov@safarov-dell yaml]$ kubectl delete -f 02-db.yaml statefulset.apps "db" deleted [safarov@safarov-dell yaml]$ kubectl get pods -l "app=db" No resources found. ``` Then I create statefullset again and check all pods is ready ```sh [safarov@safarov-dell yaml]$ kubectl create -f 02-db.yaml statefulset.apps/db created [safarov@safarov-dell yaml]$ kubectl get pods -l "app=db" NAME READY STATUS RESTARTS AGE db-0 1/1 Running 0 40s db-1 1/1 Running 0 40s db-2 1/1 Running 0 40s db-3 1/1 Running 0 40s db-4 1/1 Running 0 40s ``` And then check cluster membership again ```sh [safarov@safarov-dell yaml]$ kubectl exec -it db-0 -- curl http://127.0.0.1:5984/_membership {"all_nodes":["[email protected]","[email protected]"],"cluster_nodes":["[email protected]","[email protected]","[email protected]","[email protected]","[email protected]"]} ``` As you can see `db-0` pod not see pods `db-1`, `db-3` and `db-4`. But `db-0` pod can ask other nodes membership via dns name in url. ```sh [safarov@safarov-dell yaml]$ kubectl exec -it db-0 -- /bin/bash root@db-0:/# curl http://db-1.db:5984/_membership {"all_nodes":["[email protected]"],"cluster_nodes":["[email protected]","[email protected]","[email protected]","[email protected]","[email protected]"]} root@db-0:/# curl http://db-2.db:5984/_membership {"all_nodes":["[email protected]","[email protected]","[email protected]"],"cluster_nodes":["[email protected]","[email protected]","[email protected]","[email protected]","[email protected]"]} root@db-0:/# curl http://db-3.db:5984/_membership {"all_nodes":["[email protected]","[email protected]"],"cluster_nodes":["[email protected]","[email protected]","[email protected]","[email protected]","[email protected]"]} root@db-0:/# curl http://db-4.db:5984/_membership {"all_nodes":["[email protected]"],"cluster_nodes":["[email protected]","[email protected]","[email protected]","[email protected]","[email protected]"]} ``` As you can see couchdb cluster is broken. If I delete pod one by one, then statefullset is create pods again and new pod will be able resolv dns names of other nodes. ```sh [safarov@safarov-dell yaml]$ kubectl delete pod db-0 pod "db-0" deleted [safarov@safarov-dell yaml]$ kubectl delete pod db-1 pod "db-1" deleted [safarov@safarov-dell yaml]$ kubectl delete pod db-2 pod "db-2" deleted [safarov@safarov-dell yaml]$ kubectl delete pod db-3 pod "db-3" deleted [safarov@safarov-dell yaml]$ kubectl delete pod db-4 pod "db-4" deleted [safarov@safarov-dell yaml]$ kubectl get pods -l "app=db" NAME READY STATUS RESTARTS AGE db-0 1/1 Running 0 54s db-1 1/1 Running 0 48s db-2 1/1 Running 0 33s db-3 0/1 Running 0 14s db-4 0/1 ContainerCreating 0 7s ``` And not all nodes is joined properly ```sh [safarov@safarov-dell yaml]$ kubectl exec -it db-0 -- /bin/bash root@db-0:/# curl http://db-0.db:5984/_membership {"all_nodes":["[email protected]","[email protected]","[email protected]","[email protected]","[email protected]"],"cluster_nodes":["[email protected]","[email protected]","[email protected]","[email protected]","[email protected]"]} root@db-0:/# curl http://db-1.db:5984/_membership {"all_nodes":["[email protected]","[email protected]","[email protected]","[email protected]","[email protected]"],"cluster_nodes":["[email protected]","[email protected]","[email protected]","[email protected]","[email protected]"]} root@db-0:/# curl http://db-2.db:5984/_membership {"all_nodes":["[email protected]","[email protected]","[email protected]","[email protected]","[email protected]"],"cluster_nodes":["[email protected]","[email protected]","[email protected]","[email protected]","[email protected]"]} root@db-0:/# curl http://db-3.db:5984/_membership {"all_nodes":["[email protected]","[email protected]","[email protected]","[email protected]","[email protected]"],"cluster_nodes":["[email protected]","[email protected]","[email protected]","[email protected]","[email protected]"]} root@db-0:/# curl http://db-4.db:5984/_membership {"all_nodes":["[email protected]","[email protected]","[email protected]","[email protected]","[email protected]"],"cluster_nodes":["[email protected]","[email protected]","[email protected]","[email protected]","[email protected]"]} ``` ## Expected Behaviour 1. If peer dns record is created after CouchDB daemon started, then retry connect to peer nodes. 2. All cluster nodes as able to connect to other after statefulset recreation. ## Your Environment Kubernetes 1.13, Amazon DockerHub couchdb:2.3.1 image. ```sh root@db-0:/# cat /etc/os-release PRETTY_NAME="Debian GNU/Linux 9 (stretch)" NAME="Debian GNU/Linux" VERSION_ID="9" VERSION="9 (stretch)" ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/" ``` ## Additional context
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
