[Gluster-users] Gluster mounts becoming stale and never recovering

Jeff Bischoff Thu, 18 Jul 2019 00:58:14 -0700

Hi all,

We are having a sporadic issue with our Gluster mounts that is affecting 
several of our Kubernetes environments. We are having trouble understanding 
what is causing it, and we could use some guidance from the pros!


Scenario
We have an environment running a single-node Kubernetes with Heketi and several 
pods using Gluster mounts. The environment runs fine and the mounts appear to 
be healthy for up to several days. Suddenly, one or more (sometimes all) 
Gluster mounts have a problem and shut down the brick. The affected containers 
enter a crash loop that continues indefinitely, until someone intervenes. To 
work-around the crash loop, a user needs to trigger the bricks to be started 
again--either through manually starting them, restarting the Gluster pod or 
restarting the entire node.

Diagnostics
The tell-tale error message is seeing the following when describing a pod that 
is in a crash loop:

Message:      error while creating mount source path 
'/var/lib/kubelet/pods/4a2574bb-6fa4-11e9-a315-005056b83c80/volumes/kubernetes.io~glusterfs/db':
 mkdir 
/var/lib/kubelet/pods/4a2574bb-6fa4-11e9-a315-005056b83c80/volumes/kubernetes.io~glusterfs/db:
 file exists

We always see that "file exists" message when this error occurs.

Looking at the glusterd.log file, there had been nothing in the log for over a 
day and then suddenly, at the time the crash loop started, this:

[2019-05-08 13:49:04.733147] I [MSGID: 106143] 
[glusterd-pmap.c:397:pmap_registry_remove] 0-pmap: removing brick 
/var/lib/heketi/mounts/vg_c197878af606e71a874ad28e3bd7e4e1/brick_a3cef78a5914a2808da0b5736e3daec7/brick
 on port 49168
[2019-05-08 13:49:04.733374] I [MSGID: 106143] 
[glusterd-pmap.c:397:pmap_registry_remove] 0-pmap: removing brick 
/var/lib/heketi/mounts/vg_c197878af606e71a874ad28e3bd7e4e1/brick_7614e5014a0e402630a0e1fd776acf0a/brick
 on port 49167
[2019-05-08 13:49:05.003848] W [socket.c:599:__socket_rwv] 0-management: readv 
on /var/run/gluster/fe4ac75011a4de0e.socket failed (No data available)
[2019-05-08 13:49:05.065420] W [socket.c:599:__socket_rwv] 0-management: readv 
on /var/run/gluster/85e9fb223aa121f2.socket failed (No data available)
[2019-05-08 13:49:05.066479] W [socket.c:599:__socket_rwv] 0-management: readv 
on /var/run/gluster/e2a66e8cd8f5f606.socket failed (No data available)
[2019-05-08 13:49:05.067444] W [socket.c:599:__socket_rwv] 0-management: readv 
on /var/run/gluster/a0625e5b78d69bb8.socket failed (No data available)
[2019-05-08 13:49:05.068471] W [socket.c:599:__socket_rwv] 0-management: readv 
on /var/run/gluster/770bc294526d0360.socket failed (No data available)
[2019-05-08 13:49:05.074278] W [socket.c:599:__socket_rwv] 0-management: readv 
on /var/run/gluster/adbd37fe3e1eed36.socket failed (No data available)
[2019-05-08 13:49:05.075497] W [socket.c:599:__socket_rwv] 0-management: readv 
on /var/run/gluster/17712138f3370e53.socket failed (No data available)
[2019-05-08 13:49:05.076545] W [socket.c:599:__socket_rwv] 0-management: readv 
on /var/run/gluster/a6cf1aca8b23f394.socket failed (No data available)
[2019-05-08 13:49:05.077511] W [socket.c:599:__socket_rwv] 0-management: readv 
on /var/run/gluster/d0f83b191213e877.socket failed (No data available)
[2019-05-08 13:49:05.078447] W [socket.c:599:__socket_rwv] 0-management: readv 
on /var/run/gluster/d5dd08945d4f7f6d.socket failed (No data available)
[2019-05-08 13:49:05.079424] W [socket.c:599:__socket_rwv] 0-management: readv 
on /var/run/gluster/c8d7b10108758e2f.socket failed (No data available)
[2019-05-08 13:49:14.778619] I [MSGID: 106143] 
[glusterd-pmap.c:397:pmap_registry_remove] 0-pmap: removing brick 
/var/lib/heketi/mounts/vg_c197878af606e71a874ad28e3bd7e4e1/brick_0ed4f7f941de388cda678fe273e9ceb4/brick
 on port 49166
... (and more of the same)

Nothing further has been printed to the gluster log since. The bricks do not 
come back on their own.
The version of gluster we are using (running in a container, using the 
gluster/gluster-centos image from dockerhub):

# rpm -qa | grep gluster
glusterfs-rdma-4.1.7-1.el7.x86_64
gluster-block-0.3-2.el7.x86_64
python2-gluster-4.1.7-1.el7.x86_64
centos-release-gluster41-1.0-3.el7.centos.noarch
glusterfs-4.1.7-1.el7.x86_64
glusterfs-api-4.1.7-1.el7.x86_64
glusterfs-cli-4.1.7-1.el7.x86_64
glusterfs-geo-replication-4.1.7-1.el7.x86_64
glusterfs-libs-4.1.7-1.el7.x86_64
glusterfs-client-xlators-4.1.7-1.el7.x86_64
glusterfs-fuse-4.1.7-1.el7.x86_64
glusterfs-server-4.1.7-1.el7.x86_64

The version of glusterfs running on our Kubernetes node (a CentOS system):

]$ rpm -qa | grep gluster
glusterfs-libs-3.12.2-18.el7.x86_64
glusterfs-3.12.2-18.el7.x86_64
glusterfs-fuse-3.12.2-18.el7.x86_64
glusterfs-client-xlators-3.12.2-18.el7.x86_64

The Kubernetes version:

$  kubectl version
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", 
GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", 
BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", 
Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", 
GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", 
BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", 
Platform:"linux/amd64"}

Our gluster settings/volume options:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gluster-heketi
  selfLink: /apis/storage.k8s.io/v1/storageclasses/gluster-heketi
parameters:
  gidMax: "50000"
  gidMin: "2000"
  resturl: http://10.233.35.158:8080
  restuser: "null"
  restuserkey: "null"
  volumetype: "none"
  volumeoptions: cluster.post-op-delay-secs 0, performance.client-io-threads 
off, performance.open-behind off, performance.readdir-ahead off, 
performance.read-ahead off, performance.stat-prefetch off, 
performance.write-behind off, performance.io-cache off, 
cluster.consistent-metadata on, performance.quick-read off, 
performance.strict-o-direct on
provisioner: kubernetes.io/glusterfs
reclaimPolicy: Delete

Volume info for the heketi volume:


gluster> volume info heketidbstorage



Volume Name: heketidbstorage

Type: Distribute

Volume ID: 34b897d0-0953-4f8f-9c5c-54e043e55d92

Status: Started

Snapshot Count: 0

Number of Bricks: 1

Transport-type: tcp

Bricks:

Brick1: 
10.10.168.25:/var/lib/heketi/mounts/vg_c197878af606e71a874ad28e3bd7e4e1/brick_a16f9f0374fe5db948a60a017a3f5e60/brick

Options Reconfigured:

user.heketi.id: 1d2400626dac780fce12e45a07494853

transport.address-family: inet

nfs.disable: on

Full Gluster logs available if needed, just let me know how best to provide 
them.

Thanks in advance for any help or suggestions on this!

Best,

Jeff Bischoff
Turbonomic

This message and its attachments are intended only for the designated 
recipient(s). It may contain confidential or proprietary information and may be 
subject to legal or other confidentiality protections. If you are not a 
designated recipient, you may not review, copy or distribute this message. If 
you receive this in error, please notify the sender by reply e-mail and delete 
this message. Thank you.

_______________________________________________
Gluster-users mailing list
[email protected]
https://lists.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] Gluster mounts becoming stale and never recovering

Reply via email to