Re: [Gluster-users] Manual rsync before self-heal to prevent repaired server hanging

Dan Bretherton Fri, 30 Mar 2012 17:36:13 -0700


On 07/10/11 12:15, Dan Bretherton wrote:

Hello All,
I have replicated-distributed volumes (created with the CLI) spreadover several servers. One of the servers in the cluster has been downfor two weeks due to hardware problems and I am now ready to put itback into service. The problem is that the files on it are now verydifferent to the files on its GlusterFS replica; a lot of data hasbeen added to the GlusterFS volumes in the past two weeks, and severalusers have deleted or modified a lot of files as well. Therefore, Iam wondering if it would be better to manually synchronise the fileson the off-line server with the files on the live server beforeattempting a GlusterFS self heal on the volumes. I know how tosynchronise xattrs using rsync, but I would like to find out if thisprocedure is safe before going ahead. My main worry is that GlusterFSreplication might rely on there being differences between the xattrson replicated pairs in normal operation, and that making the xattrsthe same would break replication. Can anyone tell me if it is safe tomanually rsync a pair of replicated servers while one of them is offline?
There is another side to this story that may or may not be relevant.The hardware vendor doesn't think there is anything wrong with theserver that keeps hanging. Instead, they think that GlusterFS causesthe server to hang when a lot of file synchronisation by GlusterFSself healing is going on. I'm not sure whether or not to believethis, but the suspicion has come about because the server hangs everytime it comes back into service following the replacement of a pieceof hardware (and there is not much left of the original server insidenow). The live (and supposedly non-faulty) server has also hung on afew occasions during a large GlusterFS self heal operation (ie. oneinvolving a lot of files), and the vendor is understandably unhappyabout the prospect of taking that one apart as well. Both serversproduce ext4 related kernel errors just before they hang. They haveboth been upgraded to CentOS 5.7 since the trouble began, andGlusterFS on all servers has been upgraded from 3.2.3 to 3.2.4. Thevendor suggested manually synchronising the two servers with rsyncbefore starting glusterd on the server that has been repaired. I havebeen trying to break the server with rsync and various stress testingutilities without success for the past couple of days, so the vendor'sview is that rsync is safe, but a large amount of continuous GlusterFSfile synchronisation is not. I would be happy to use the rsyncapproach if it keeps the servers running, as long as it doesn't ruinmy xattrs.
Any comments or suggestions would be much appreciated.
Regards
Dan Bretherton.


Dear All-

I managed to re-synchronise the two servers using rsync -X on the adviceof Gluster, and finished off by triggering a GlusterFS self-heal afterswitching on glusterd on the server that had been down for two weeks.This avoided the load spikes that can be caused by GlusterFS self-heal,which may have contributed to the problems I was having with servershanging. I discovered that two of the bricks involved contained about15 million files in total, many of which were created on one serverwhile the other one was down. It doesn't surprise me that GlusterFS hadproblems trying to self-heal all those, especially when most of themneeded to be copied to the server that had been off line.

I also discovered two ext4 filesystems without journals on the serversthat had been hanging. I don't know if the journals had been damaged bythe repeated hard resets, or if they had never been present since thefilesystems were created. All the ext4 filesystems were created in thesame way so I don't know why two of them ended up without journals. Iwas using CentOS 5.5 at the time, and I am worried about the fact thatext4 was not officially supported in CentOS until 5.6. Those ext4related kernel errors on the screen after the servers hung certainlylooked scary. I have now upgraded most of the servers to CentOS 5.7 toavoid any further other problems with ext4.


-Dan
_______________________________________________
Gluster-users mailing list
[email protected]
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Manual rsync before self-heal to prevent repaired server hanging

Reply via email to