Jeff,
The main question is therefore why
we're losing connectivity to these servers.
Could there be a hardware issue? I have replaced the network cables for
the two servers but I don't really know what else to check. The network
switch hasn't recorded any errors for those two ports. There isn't
anything sinister in /var/log/messages.
It seems a bit of a coincidence that both servers lost connection at
exactly the same time. The only thing the users have started doing
differently recently is processing a large number of small text files.
There is one particular application they are running that processes this
data, but the load on the Glusterfs servers doesn't go up when it is
running.
-Dan
On 02/23/2012 02:41 PM, Jeff Darcy wrote:
On 02/23/2012 08:58 AM, Dan Bretherton wrote:
It is reassuring to know that these errors are self repairing. That does
appear to be happening, but only when I run "find -print0 | xargs --null stat
/dev/null" in affected directories.
Hm. Then maybe the xattrs weren't *set* on that brick.
I will run that self-heal on the whole
volume as well, but I have had to start with specific directories that people
want to work in today. Does repeating the fix-layout operation have any
effect, or are the xattr repairs all done by the self-heal mechanism?
AFAICT the DHT self-heal mechanism (not to be confused with the better known
AFR self-heal mechanism) will take care of this. Running fix-layout would be
redundant for those directories, but not harmful.
I have found the cause of the transient brick failure; it happened again this
morning on a replicated pair of bricks. Suddenly the
etc-glusterfs-glusterd.vol.log file was flooded with these messages every few
seconds.
E [socket.c:2080:socket_connect] 0-management: connection attempt failed
(Connection refused)
One of the clients then reported errors like the following.
[2012-02-23 11:19:22.922785] E [afr-common.c:3164:afr_notify]
2-atmos-replicate-3: All subvolumes are down. Going offline until atleast one
of them comes back up.
[2012-02-23 11:19:22.923682] I [dht-layout.c:581:dht_layout_normalize]
0-atmos-dht: found anomalies in /. holes=1 overlaps=0
Bingo. This is exactly how DHT subvolume #3 could "miss out" on a directory
being created or updated, as seems to have happened.
[2012-02-23 11:19:22.923714] I [dht-selfheal.c:569:dht_selfheal_directory]
0-atmos-dht: 1 subvolumes down -- not fixing
[2012-02-23 11:19:22.941468] W [socket.c:1494:__socket_proto_state_machine]
1-atmos-client-7: reading from socket failed. Error (Transport endpoint is not
connected), peer (192.171.166.89:24019)
[2012-02-23 11:19:22.972307] I [client.c:1883:client_rpc_notify]
1-atmos-client-7: disconnected
[2012-02-23 11:19:22.972352] E [afr-common.c:3164:afr_notify]
1-atmos-replicate-3: All subvolumes are down. Going offline until atleast one
of them comes back up.
The servers causing trouble were still showing as Connected in "gluster peer
status" and nothing appeared to be wrong except for glusterd misbehaving.
Restarting glusterd solved the problem, but given that this has happened twice
this week already I am worried that it could happen again at any time. Do you
know what might be causing glusterd to stop responding like this?
The glusterd failures and the brick failures are likely to share a common
cause, as opposed to one causing the other. The main question is therefore why
we're losing connectivity to these servers. Secondarily, there might be a bug
to do with the failure being seen in the I/O path but not in the peer path, but
that's not likely to be the *essential* problem.
_______________________________________________
Gluster-users mailing list
[email protected]
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users