On Mon, May 17, 2021 at 4:22 PM Marco Fais wrote:
> Hi,
>
> I am having significant issues with glustershd with releases 8.4 and 9.1.
>
> My oVirt clusters are using gluster storage backends, and were running
> fine with Gluster 7.x (shipped with earlier versions of oVirt Node 4.4.x).
> Recently the oVirt project moved to Gluster 8.4 for the nodes, and hence I
> have moved to this release when upgrading my clusters.
>
> Since then I am having issues whenever one of the nodes is brought down;
> when the nodes come back up online the bricks are typically back up and
> working, but some (random) glustershd processes in the various nodes seem
> to have issues connecting to some of them.
>
>
When the issue happens, can you check if the TCP port number of the brick
(glusterfsd) processes displayed in `gluster volume status` matches with
that of the actual port numbers observed (i.e. the --brick-port argument)
when you run `ps aux | grep glusterfsd` ? If they don't match, then
glusterd has incorrect brick port information in its memory and serving it
to glustershd. Restarting glusterd instead of (killing the bricks + `volume
start force`) should fix it, although we need to find why glusterd serves
incorrect port numbers.
If they do match, then can you take a statedump of glustershd to check that
it is indeed disconnected from the bricks? You will need to verify that
'connected=1' in the statedump. See "Self-heal is stuck/ not getting
completed." section in
https://docs.gluster.org/en/latest/Troubleshooting/troubleshooting-afr/.
Statedump can be taken by `kill -SIGUSR1 $pid-of-glustershd`. It will be
generated in the /var/run/gluster/ directory.
Regards,
Ravi
Community Meeting Calendar:
Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users