Re: [Gluster-users] "mismatching layouts" errors after expanding volume

Dan Bretherton Thu, 23 Feb 2012 05:58:38 -0800

Thanks Jeff, that's interesting.

It is reassuring to know that these errors are self repairing. Thatdoes appear to be happening, but only when I run "find -print0 | xargs--null stat >/dev/null" in affected directories. I will run thatself-heal on the whole volume as well, but I have had to start withspecific directories that people want to work in today. Does repeatingthe fix-layout operation have any effect, or are the xattr repairs alldone by the self-heal mechanism?

I have found the cause of the transient brick failure; it happened againthis morning on a replicated pair of bricks. Suddenly theetc-glusterfs-glusterd.vol.log file was flooded with these messagesevery few seconds.

E [socket.c:2080:socket_connect] 0-management: connection attempt failed(Connection refused)


One of the clients then reported errors like the following.

[2012-02-23 11:19:22.922785] E [afr-common.c:3164:afr_notify]2-atmos-replicate-3: All subvolumes are down. Going offline untilatleast one of them comes back up.[2012-02-23 11:19:22.923682] I [dht-layout.c:581:dht_layout_normalize]0-atmos-dht: found anomalies in /. holes=1 overlaps=0[2012-02-23 11:19:22.923714] I[dht-selfheal.c:569:dht_selfheal_directory] 0-atmos-dht: 1 subvolumesdown -- not fixing

[2012-02-23 11:19:22.941468] W[socket.c:1494:__socket_proto_state_machine] 1-atmos-client-7: readingfrom socket failed. Error (Transport endpoint is not connected), peer(192.171.166.89:24019)[2012-02-23 11:19:22.972307] I [client.c:1883:client_rpc_notify]1-atmos-client-7: disconnected[2012-02-23 11:19:22.972352] E [afr-common.c:3164:afr_notify]1-atmos-replicate-3: All subvolumes are down. Going offline untilatleast one of them comes back up.

The servers causing trouble were still showing as Connected in "glusterpeer status" and nothing appeared to be wrong except for glusterdmisbehaving. Restarting glusterd solved the problem, but given thatthis has happened twice this week already I am worried that it couldhappen again at any time. Do you know what might be causing glusterd tostop responding like this?


Regards
Dan.


On 02/22/2012 08:00 PM, [email protected] wrote:

Date: Wed, 22 Feb 2012 10:32:31 -0500
From: Jeff Darcy<[email protected]>
Subject: Re: [Gluster-users] "mismatching layouts" errors after
        expanding volume
To:[email protected]
Message-ID:<[email protected]>
Content-Type: text/plain; charset=ISO-8859-1

Following up on the previous reply...

On 02/22/2012 02:52 AM, Dan Bretherton wrote:

>  [2012-02-16 22:59:42.504907] I
>  [dht-layout.c:682:dht_layout_dir_mismatch] 0-atmos-dht: subvol:
>  atmos-replicate-0; inode layout - 0 - 0; disk layout - 9203501
>  34 - 1227133511
>  [2012-02-16 22:59:42.534399] I [dht-common.c:524:dht_revalidate_cbk]
>  0-atmos-dht: mismatching layouts for /users/rle/TRACKTEMP/TRACKS

On 02/22/2012 09:19 AM, Jeff Darcy wrote:

>  OTOH, the log entries below do seem to indicate that there's something going 
on
>  that I don't understand.  I'll dig a bit, and let you know if I find anything
>  to change my mind wrt the safety of restoring write access.

The two messages above are paired, in the sense that the second is inevitable
after the first. The "disk layout" range shown in the first is exactly what I
would expect for subvolume 3 out of 0-13. That means the trusted.glusterfs.dht
value on disk seems reasonable. The corresponding in-memory "inode layout"
entry has the less reasonable value of all zero. That probably means we failed
to fetch the xattr at some point in the past. There might be something earlier
in your logs - perhaps a message about "holes" and/or one specifically
mentioning that subvolume - to explain why.

The good news is that this should be self-repairing. Once we get these
messages, we try to re-fetch the layout information from all subvolumes. If
*that*  failed, we'd see more messages than those above. Since the on-disk
values seem OK and revalidation seems to be succeeding, I would say these
messages probably represent successful attempts to recover from a transient
brick failure, and that does*not*  change what I said previously.

_______________________________________________
Gluster-users mailing list
[email protected]
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] "mismatching layouts" errors after expanding volume

Reply via email to