Re: [Gluster-users] Not real confident in 3.3

Sean Fulton Sat, 16 Jun 2012 13:48:07 -0700

A little more information on this, which makes it more puzzling:

1) The split-brain message is strange because there are only two servernodes and 1 client node which has mounted the volume via NFS on afloating IP. This was done to guarantee that only one node gets writtento at any point in time, so there is zero chance that two nodes wereupdated simultaneously.

2) I re-ran the cpio command to put a file tree on the gluster volume,then made a tar of the file tree. The tree was not being written to byanyone, and yet every 5 to 10 files tar would report "file changed as Iread it." At first I thought there was some sort of healing operationgoing on, but since I was only writing to one node at a time, thenmaking the backup, I don't see how this was possible.

I've checked the network, resources, etc., and there are no issuesthere, no packet loss, all machines share the same time via NTP, etc.The OS is SL 6.1.


So this is all very strange behavior.

sean

On 06/16/2012 01:48 PM, Sean Fulton wrote:

I do not mean to be argumentative, but I have to admit a littlefrustration with Gluster. I know an enormous emount of effort has goneinto this product, and I just can't believe that with all the effortbehind it and so many people using it, it could be so fragile.
So here goes. Perhaps someone here can point to the error of my ways.I really want this to work because it would be ideal for ourenvironment, but ...
Please note that all of the nodes below are OpenVZ nodes withnfs/nfsd/fuse modules loaded on the hosts.
After spending months trying to get 3.2.5 and 3.2.6 working in aproduction environment, I gave up on Gluster and went with aLinux-HA/NFS cluster which just works. The problems I had with glusterwere strange lock-ups, split brains, and too many instances where thewhole cluster was off-line until I reloaded the data.
So wiith the release of 3.3, I decided to give it another try. Icreated one relicated volume on my two NFS servers.
I then mounted the volume on a client as follows:
10.10.10.7:/pub2 /pub2 nfsrw,noacl,noatime,nodiratime,soft,proto=tcp,vers=3,defaults 0 0
I threw some data at it (find / -mount -print | cpio -pvdum /pub2/test)
Within 10 seconds it locked up solid. No error messages on any of theservers, the client was unresponsive and load on the client was 15+. Irestarted glusterd on both of my NFS servers, and the client remainedlocked. Finally I killed the cpio process on the client. When Istarted another cpio, it runs further than before, but now the logs onmy NFS/Gluster server say:
[2012-06-16 13:37:35.242754] I[afr-self-heal-common.c:1318:afr_sh_missing_entries_lookup_done]0-pub2-replicate-0: No sources for dir of<gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure, in missingentry self-heal, continuing with the rest of the self-heals[2012-06-16 13:37:35.243315] I[afr-self-heal-common.c:994:afr_sh_missing_entries_done]0-pub2-replicate-0: split brain found, aborting selfheal of<gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure[2012-06-16 13:37:35.243350] E[afr-self-heal-common.c:2156:afr_self_heal_completion_cbk]0-pub2-replicate-0: background data gfid self-heal failed on<gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure
This still seems to be an INCREDIBLY fragile system. Why would it locksolid while copying a large file? Why no errors in the logs?
I am the only one seeing this kind of behavior?

sean


--
Sean Fulton
GCN Publishing, Inc.
Internet Design, Development and Consulting For Today's Media Companies
http://www.gcnpublishing.com
(203) 665-6211, x203



_______________________________________________
Gluster-users mailing list
[email protected]
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Not real confident in 3.3

Reply via email to