On 07/10/11 12:15, Dan Bretherton wrote:
Hello All,
I have replicated-distributed volumes (created with the CLI) spread
over several servers. One of the servers in the cluster has been down
for two weeks due to hardware problems and I am now ready to put it
back into service. The problem is that the files on it are now very
different to the files on its GlusterFS replica; a lot of data has
been added to the GlusterFS volumes in the past two weeks, and several
users have deleted or modified a lot of files as well. Therefore, I
am wondering if it would be better to manually synchronise the files
on the off-line server with the files on the live server before
attempting a GlusterFS self heal on the volumes. I know how to
synchronise xattrs using rsync, but I would like to find out if this
procedure is safe before going ahead. My main worry is that GlusterFS
replication might rely on there being differences between the xattrs
on replicated pairs in normal operation, and that making the xattrs
the same would break replication. Can anyone tell me if it is safe to
manually rsync a pair of replicated servers while one of them is off
line?
There is another side to this story that may or may not be relevant.
The hardware vendor doesn't think there is anything wrong with the
server that keeps hanging. Instead, they think that GlusterFS causes
the server to hang when a lot of file synchronisation by GlusterFS
self healing is going on. I'm not sure whether or not to believe
this, but the suspicion has come about because the server hangs every
time it comes back into service following the replacement of a piece
of hardware (and there is not much left of the original server inside
now). The live (and supposedly non-faulty) server has also hung on a
few occasions during a large GlusterFS self heal operation (ie. one
involving a lot of files), and the vendor is understandably unhappy
about the prospect of taking that one apart as well. Both servers
produce ext4 related kernel errors just before they hang. They have
both been upgraded to CentOS 5.7 since the trouble began, and
GlusterFS on all servers has been upgraded from 3.2.3 to 3.2.4. The
vendor suggested manually synchronising the two servers with rsync
before starting glusterd on the server that has been repaired. I have
been trying to break the server with rsync and various stress testing
utilities without success for the past couple of days, so the vendor's
view is that rsync is safe, but a large amount of continuous GlusterFS
file synchronisation is not. I would be happy to use the rsync
approach if it keeps the servers running, as long as it doesn't ruin
my xattrs.
Any comments or suggestions would be much appreciated.
Regards
Dan Bretherton.
Dear All-
I managed to re-synchronise the two servers using rsync -X on the advice
of Gluster, and finished off by triggering a GlusterFS self-heal after
switching on glusterd on the server that had been down for two weeks.
This avoided the load spikes that can be caused by GlusterFS self-heal,
which may have contributed to the problems I was having with servers
hanging. I discovered that two of the bricks involved contained about
15 million files in total, many of which were created on one server
while the other one was down. It doesn't surprise me that GlusterFS had
problems trying to self-heal all those, especially when most of them
needed to be copied to the server that had been off line.
I also discovered two ext4 filesystems without journals on the servers
that had been hanging. I don't know if the journals had been damaged by
the repeated hard resets, or if they had never been present since the
filesystems were created. All the ext4 filesystems were created in the
same way so I don't know why two of them ended up without journals. I
was using CentOS 5.5 at the time, and I am worried about the fact that
ext4 was not officially supported in CentOS until 5.6. Those ext4
related kernel errors on the screen after the servers hung certainly
looked scary. I have now upgraded most of the servers to CentOS 5.7 to
avoid any further other problems with ext4.
-Dan
_______________________________________________
Gluster-users mailing list
[email protected]
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users