Re: [Gluster-users] Rebuild Distributed/Replicated Setup

Pranith Kumar. Karampuri Thu, 19 May 2011 02:12:49 -0700

Remi,
     Sorry I think you want to keep web02 as the source and web01 as the sink, 
so the commands need to be executed on web01:


1) sudo setxattr -n trusted.afr.shared-application-data-client-1 -v 
0sAAAAAAAAAAAAAAAA <file-name>.
2) then do a find on the <file-name>,

Thanks
Pranith

----- Original Message -----
From: "Pranith Kumar. Karampuri" <[email protected]>
To: "Remi Broemeling" <[email protected]>
Cc: [email protected]
Sent: Thursday, May 19, 2011 2:14:52 PM
Subject: Re: [Gluster-users] Rebuild Distributed/Replicated Setup

hi Remi,
    This is a classic case of split-brain. See if the md5sum of the files in 
question matches on both web01, web02. If yes you can safely reset the xattr of 
the file on one of the replicas to trigger self-heal. If the md5sums dont 
match, you will have to select the machine you want to keep as the source (In 
your case it is web01), go to the other machine (In your case it is web02) and 
execute the following commands:

1) sudo setxattr -n trusted.afr.shared-application-data-client-0 -v 
0sAAAAAAAAAAAAAAAA <file-name>.
2) then do a find on the <file-name>,
 that will trigger self-heal and both copies will be in replication again.

Self-heal can cause a performance hit if you trigger self-heal for all the 
files at once if they are BIG files. so trigger 1 after the other upon 
completion in that case.

Let me know if you need any more help with this. Removing the whole web02 data 
and triggering a total self-heal is very expensive operation, I wouldn't do 
that.

Pranith.
----- Original Message -----
From: "Remi Broemeling" <[email protected]>
To: "Pranith Kumar. Karampuri" <[email protected]>
Cc: [email protected]
Sent: Wednesday, May 18, 2011 8:21:33 PM
Subject: Re: [Gluster-users] Rebuild Distributed/Replicated Setup

Sure, 

These files are just a sampling -- a lot of other files are showing the same 
"split-brain" behaviour. 

[14:42:45][root@web01:/var/glusterfs/bricks/shared]# getfattr -d -m 
"trusted.afr*" agc/production/log/809223185/contact.log 
# file: agc/production/log/809223185/contact.log 
trusted.afr.shared-application-data-client-0=0sAAAAAAAAAAAAAAAA 
trusted.afr.shared-application-data-client-1=0sAAAABQAAAAAAAAAA 
[14:45:15][root@web02:/var/glusterfs/bricks/shared]# getfattr -d -m 
"trusted.afr*" agc/production/log/809223185/contact.log 
# file: agc/production/log/809223185/contact.log 
trusted.afr.shared-application-data-client-0=0sAAACOwAAAAAAAAAA 
trusted.afr.shared-application-data-client-1=0sAAAAAAAAAAAAAAAA 

[14:42:53][root@web01:/var/glusterfs/bricks/shared]# getfattr -d -m 
"trusted.afr*" agc/production/log/809223185/event.log 
# file: agc/production/log/809223185/event.log 
trusted.afr.shared-application-data-client-0=0sAAAAAAAAAAAAAAAA 
trusted.afr.shared-application-data-client-1=0sAAAADgAAAAAAAAAA 
[14:45:24][root@web02:/var/glusterfs/bricks/shared]# getfattr -d -m 
"trusted.afr*" agc/production/log/809223185/event.log 
# file: agc/production/log/809223185/event.log 
trusted.afr.shared-application-data-client-0=0sAAAGXQAAAAAAAAAA 
trusted.afr.shared-application-data-client-1=0sAAAAAAAAAAAAAAAA 

[14:43:02][root@web01:/var/glusterfs/bricks/shared]# getfattr -d -m 
"trusted.afr*" agc/production/log/809223635/contact.log 
# file: agc/production/log/809223635/contact.log 
trusted.afr.shared-application-data-client-0=0sAAAAAAAAAAAAAAAA 
trusted.afr.shared-application-data-client-1=0sAAAACgAAAAAAAAAA 
[14:45:28][root@web02:/var/glusterfs/bricks/shared]# getfattr -d -m 
"trusted.afr*" agc/production/log/809223635/contact.log 
# file: agc/production/log/809223635/contact.log 
trusted.afr.shared-application-data-client-0=0sAAAELQAAAAAAAAAA 
trusted.afr.shared-application-data-client-1=0sAAAAAAAAAAAAAAAA 

[14:43:39][root@web01:/var/glusterfs/bricks/shared]# getfattr -d -m 
"trusted.afr*" agc/production/log/809224061/contact.log 
# file: agc/production/log/809224061/contact.log 
trusted.afr.shared-application-data-client-0=0sAAAAAAAAAAAAAAAA 
trusted.afr.shared-application-data-client-1=0sAAAACQAAAAAAAAAA 
[14:45:32][root@web02:/var/glusterfs/bricks/shared]# getfattr -d -m 
"trusted.afr*" agc/production/log/809224061/contact.log 
# file: agc/production/log/809224061/contact.log 
trusted.afr.shared-application-data-client-0=0sAAAD+AAAAAAAAAAA 
trusted.afr.shared-application-data-client-1=0sAAAAAAAAAAAAAAAA 

[14:43:42][root@web01:/var/glusterfs/bricks/shared]# getfattr -d -m 
"trusted.afr*" agc/production/log/809224321/contact.log 
# file: agc/production/log/809224321/contact.log 
trusted.afr.shared-application-data-client-0=0sAAAAAAAAAAAAAAAA 
trusted.afr.shared-application-data-client-1=0sAAAACAAAAAAAAAAA 
[14:45:37][root@web02:/var/glusterfs/bricks/shared]# getfattr -d -m 
"trusted.afr*" agc/production/log/809224321/contact.log 
# file: agc/production/log/809224321/contact.log 
trusted.afr.shared-application-data-client-0=0sAAAERAAAAAAAAAAA 
trusted.afr.shared-application-data-client-1=0sAAAAAAAAAAAAAAAA 

[14:43:45][root@web01:/var/glusterfs/bricks/shared]# getfattr -d -m 
"trusted.afr*" agc/production/log/809215319/event.log 
# file: agc/production/log/809215319/event.log 
trusted.afr.shared-application-data-client-0=0sAAAAAAAAAAAAAAAA 
trusted.afr.shared-application-data-client-1=0sAAAABwAAAAAAAAAA 
[14:45:45][root@web02:/var/glusterfs/bricks/shared]# getfattr -d -m 
"trusted.afr*" agc/production/log/809215319/event.log 
# file: agc/production/log/809215319/event.log 
trusted.afr.shared-application-data-client-0=0sAAAC/QAAAAAAAAAA 
trusted.afr.shared-application-data-client-1=0sAAAAAAAAAAAAAAAA 


On Wed, May 18, 2011 at 01:31, Pranith Kumar. Karampuri < [email protected] 
> wrote: 


hi Remi, 
It seems the split-brain is detected on following files: 
/agc/production/log/809223185/contact.log 
/agc/production/log/809223185/event.log 
/agc/production/log/809223635/contact.log 
/agc/production/log/809224061/contact.log 
/agc/production/log/809224321/contact.log 
/agc/production/log/809215319/event.log 

Could you give the output of the following command for each file above on both 
the bricks in the replica pair. 

getxattr -d -m "trusted.afr*" <filepath> 

Thanks 

Pranith 

----- Original Message ----- 
From: "Remi Broemeling" < [email protected] > 
To: [email protected] 



Sent: Tuesday, May 17, 2011 9:02:44 PM 
Subject: Re: [Gluster-users] Rebuild Distributed/Replicated Setup 


Hi Pranith. Sure, here is a pastebin sampling of logs from one of the hosts: 
http://pastebin.com/1U1ziwjC 


On Mon, May 16, 2011 at 20:48, Pranith Kumar. Karampuri < [email protected] 
> wrote: 


hi Remi, 
Would it be possible to post the logs on the client, so that we can find what 
issue you are running into. 

Pranith 



----- Original Message ----- 
From: "Remi Broemeling" < [email protected] > 
To: [email protected] 
Sent: Monday, May 16, 2011 10:47:33 PM 
Subject: [Gluster-users] Rebuild Distributed/Replicated Setup 


Hi, 

I've got a distributed/replicated GlusterFS v3.1.2 (installed via RPM) setup 
across two servers (web01 and web02) with the following vol config: 

volume shared-application-data-client-0 
type protocol/client 
option remote-host web01 
option remote-subvolume /var/glusterfs/bricks/shared 
option transport-type tcp 
option ping-timeout 5 
end-volume 

volume shared-application-data-client-1 
type protocol/client 
option remote-host web02 
option remote-subvolume /var/glusterfs/bricks/shared 
option transport-type tcp 
option ping-timeout 5 
end-volume 

volume shared-application-data-replicate-0 
type cluster/replicate 
subvolumes shared-application-data-client-0 shared-application-data-client-1 
end-volume 

volume shared-application-data-write-behind 
type performance/write-behind 
subvolumes shared-application-data-replicate-0 
end-volume 

volume shared-application-data-read-ahead 
type performance/read-ahead 
subvolumes shared-application-data-write-behind 
end-volume 

volume shared-application-data-io-cache 
type performance/io-cache 
subvolumes shared-application-data-read-ahead 
end-volume 

volume shared-application-data-quick-read 
type performance/quick-read 
subvolumes shared-application-data-io-cache 
end-volume 

volume shared-application-data-stat-prefetch 
type performance/stat-prefetch 
subvolumes shared-application-data-quick-read 
end-volume 

volume shared-application-data 
type debug/io-stats 
subvolumes shared-application-data-stat-prefetch 
end-volume 

In total, four servers mount this via GlusterFS FUSE. For whatever reason (I'm 
really not sure why), the GlusterFS filesystem has run into a bit of 
split-brain nightmare (although to my knowledge an actual split brain situation 
has never occurred in this environment), and I have been getting solidly 
corrupted issues across the filesystem as well as complaints that the 
filesystem cannot be self-healed. 

What I would like to do is completely empty one of the two servers (here I am 
trying to empty server web01), making the other one (in this case web02) the 
authoritative source for the data; and then have web01 completely rebuild it's 
mirror directly from web02. 

What's the easiest/safest way to do this? Is there a command that I can run 
that will force web01 to re-initialize it's mirror directly from web02 (and 
thus completely eradicate all of the split-brain errors and data 
inconsistencies)? 

Thanks! 

-- 

Remi Broemeling 
System Administrator 
Clio - Practice Management Simplified 
1-888-858-2546 x(2^5) | [email protected] 
www.goclio.com | blog | twitter | facebook 
_______________________________________________
Gluster-users mailing list
[email protected]
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
[email protected]
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Rebuild Distributed/Replicated Setup

Reply via email to