On Thu, Sep 14, 2017 at 12:58 AM, Ben Werthmann <[email protected]> wrote:
> I ran into something like this in 3.10.4 and filed two bugs for it: > > https://bugzilla.redhat.com/show_bug.cgi?id=1491059 > https://bugzilla.redhat.com/show_bug.cgi?id=1491060 > > Please see the above bugs for full detail. > > In summary, my issue was related to glusterd's pid handling of pid files > when is starts self-heal and bricks. The issues are: > > a. brick pid file leaves stale pid and brick fails to start when glusterd is > started. pid files are stored in `/var/lib/glusterd` which persists across > reboots. When glusterd is started (or restarted or host rebooted) and the pid > of any process matching the pid in the brick pid file, brick fails to start. > > b. self-heal-deamon pid file leave stale pid and indiscriminately kills pid > when glusterd is started. pid files are stored in `/var/lib/glusterd` which > persists across reboots. When glusterd is started (or restarted or host > rebooted) the pid of any process matching the pid in the shd pid file is > killed. > > due to the nature of these bugs sometimes bricks/shd will start, sometimes > they will not, restart success may be intermittent. This bug is most likely > to occur when services were running with a low pid, then the host is rebooted > since reboots tend to densely group pids in lower pid numbers. You might also > see it if you have high pid churn due to short lived processes. > > In the case of self-heal daemon, you may also see other processes "randomly" > being terminated. > > resulting in: > > 1a. pid file /var/lib/glusterd/glustershd/run/glustershd.pid remains after > shd is stopped > 2a. glusterd kills any process number in the stale shd pid file. > 1b. brick pid file(s) remain after brick is stopped > 2b. glusterd fails to start brick when the pid in a pid file matches any > running process > > Workaround: > > in our automation, when we stop all gluster processes (reboot, upgrade, etc.) > we ensure all processes are stopped and then cleanup the pids with: > 'find /var/lib/glusterd/ -name '*pid' -delete' > > I've added comment in both the bugs. Good news is that this is already fixed in 3.12.0. > > This is not a complete solution, but works in our most critical times. We may > develop something more complete if the bug is not addressed promptly. > > > > > On Sat, Aug 5, 2017 at 11:54 PM, Leonid Isaev <[email protected]. > edu> wrote: > >> Hi, >> >> I have a distributed volume which runs on Fedora 26 systems with >> glusterfs 3.11.2 from gluster.org repos: >> ---------- >> [root@taupo ~]# glusterd --version >> glusterfs 3.11.2 >> >> gluster> volume info gv2 >> Volume Name: gv2 >> Type: Distribute >> Volume ID: 6b468f43-3857-4506-917c-7eaaaef9b6ee >> Status: Started >> Snapshot Count: 0 >> Number of Bricks: 6 >> Transport-type: tcp >> Bricks: >> Brick1: kiwi:/srv/gluster/gv2/brick1/gvol >> Brick2: kiwi:/srv/gluster/gv2/brick2/gvol >> Brick3: taupo:/srv/gluster/gv2/brick1/gvol >> Brick4: fox:/srv/gluster/gv2/brick1/gvol >> Brick5: fox:/srv/gluster/gv2/brick2/gvol >> Brick6: logan:/srv/gluster/gv2/brick1/gvol >> Options Reconfigured: >> performance.readdir-ahead: on >> nfs.disable: on >> >> gluster> volume status gv2 >> Status of volume: gv2 >> Gluster process TCP Port RDMA Port Online >> Pid >> ------------------------------------------------------------ >> ------------------ >> Brick kiwi:/srv/gluster/gv2/brick1/gvol 49152 0 Y >> 1128 >> Brick kiwi:/srv/gluster/gv2/brick2/gvol 49153 0 Y >> 1134 >> Brick taupo:/srv/gluster/gv2/brick1/gvol N/A N/A N >> N/A >> Brick fox:/srv/gluster/gv2/brick1/gvol 49152 0 Y >> 1169 >> Brick fox:/srv/gluster/gv2/brick2/gvol 49153 0 Y >> 1175 >> Brick logan:/srv/gluster/gv2/brick1/gvol 49152 0 Y >> 1003 >> ---------- >> >> The machine in question is TAUPO which has one brick that refuses to >> connect to >> the cluster. All installations were migrated from glusterfs 3.8.14 on >> Fedora >> 24: I simply rsync'ed /var/lib/glusterd to new systems. On all other >> machines >> glusterd starts fine and all bricks come up. Hence I suspect a race >> condition >> somewhere. The glusterd.log file (attached) shows that the brick >> connects, and >> then suddenly disconnects from the cluster: >> ---------- >> [2017-08-06 03:12:38.536409] I [glusterd-utils.c:5468:glusterd_brick_start] >> 0-management: discovered already-running brick /srv/gluster/gv2/brick1/gvol >> [2017-08-06 03:12:38.536414] I [MSGID: 106143] >> [glusterd-pmap.c:279:pmap_registry_bind] 0-pmap: adding brick >> /srv/gluster/gv2/brick1/gvol on port 49153 >> [2017-08-06 03:12:38.536427] I [rpc-clnt.c:1059:rpc_clnt_connection_init] >> 0-management: setting frame-timeout to 600 >> [2017-08-06 03:12:38.536500] I [rpc-clnt.c:1059:rpc_clnt_connection_init] >> 0-snapd: setting frame-timeout to 600 >> [2017-08-06 03:12:38.536556] I [rpc-clnt.c:1059:rpc_clnt_connection_init] >> 0-snapd: setting frame-timeout to 600 >> [2017-08-06 03:12:38.536616] I [MSGID: 106492] >> [glusterd-handler.c:2717:__glusterd_handle_friend_update] 0-glusterd: >> Received friend update from uuid: d5a487e3-4c9b-4e5a-91ff-b8d85fd51da9 >> [2017-08-06 03:12:38.584598] I [MSGID: 106502] >> [glusterd-handler.c:2762:__glusterd_handle_friend_update] 0-management: >> Received my uuid as Friend >> [2017-08-06 03:12:38.599340] I [socket.c:2474:socket_event_handler] >> 0-transport: EPOLLERR - disconnecting now >> [2017-08-06 03:12:38.613745] I [MSGID: 106005] >> [glusterd-handler.c:5846:__glusterd_brick_rpc_notify] 0-management: >> Brick taupo:/srv/gluster/gv2/brick1/gvol has disconnected from glusterd. >> ---------- >> >> I checked that cluster.brick-multiplex is off. How can I debug this >> further? >> >> Thanks in advance, >> -- >> Leonid Isaev >> >> _______________________________________________ >> Gluster-users mailing list >> [email protected] >> http://lists.gluster.org/mailman/listinfo/gluster-users >> > > > _______________________________________________ > Gluster-users mailing list > [email protected] > http://lists.gluster.org/mailman/listinfo/gluster-users >
_______________________________________________ Gluster-users mailing list [email protected] http://lists.gluster.org/mailman/listinfo/gluster-users
