Re: [Gluster-users] Not real confident in 3.3

Sean Fulton Sun, 17 Jun 2012 05:21:33 -0700

This was a Linux-HA cluster with a floating IP that the clients wouldmount off of whichever server is active. So I set up a two-nodereplicated cluster, which the floating IP and heartbeat, and the clientmounted the drive over the floating IP. I'm using the NFS server builtinto gluster. So rpcbind and nfslock are running on the server, but notnfs. The client writes to the one server with the floating IP, andgluster takes care of keeping the volume in sync between the twoservers. I thought that was the way to do it.



sean


On 06/16/2012 07:17 PM, Frank Sonntag wrote:

Hi Sean,

What kernel are you using? I had similar trouble (hanging processes) with a 
installed Centos 6.2 system. All the 2.6.32 kernels available in the centos 
repository gave me these hangs and only after upgrading to v3 of there kernel 
(using the EL repositories) could I fix the problem. I even had these hangs 
when using just kernel based NFS, so my problem could be different of course.
gluster 3.2.6 now works fine for me (using one replicated and one distributed 
volume anyway. I still have trouble with a getting nufa to work).
BTW I just realized you are using the NFS to mount a replicated volume on the 
client. Is that right? I don't think his will work since the gluster client is 
the component doing the replication (i.e. sending files to both servers).

Frank





On 17/06/2012, at 9:12 AM, Sean Fulton wrote:

Let me re-iterate, I really, really want to see Gluster work for our 
environment. I am hopeful this is something I did or something that can be 
easily fixed.

Yes, there was an error on the client server:

[586898.273283] INFO: task flush-0:45:633954 blocked for more than 120 seconds.
[586898.273290] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
[586898.273295] flush-0:45    D ffff8806037592d0     0 633954      2    0 
0x00000000
[586898.273304]  ffff88000d1ebbe0 0000000000000046 ffff88000d1ebd6c 
0000000000000000
[586898.273312]  ffff88000d1ebce0 ffffffff81054444 ffff88000d1ebc80 
ffff88000d1ebbf0
[586898.273319]  ffff8806050ac5f8 ffff880603759888 ffff88000d1ebfd8 
ffff88000d1ebfd8
[586898.273326] Call Trace:
[586898.273335]  [<ffffffff81054444>] ? find_busiest_group+0x244/0xb20
[586898.273343]  [<ffffffff811ab050>] ? inode_wait+0x0/0x20
[586898.273349]  [<ffffffff811ab05e>] inode_wait+0xe/0x20
[586898.273357]  [<ffffffff814e752f>] __wait_on_bit+0x5f/0x90
[586898.273365]  [<ffffffff811bbd6c>] ? writeback_sb_inodes+0x13c/0x210
[586898.273370]  [<ffffffff811bab28>] inode_wait_for_writeback+0x98/0xc0
[586898.273377]  [<ffffffff81095550>] ? wake_bit_function+0x0/0x50
[586898.273382]  [<ffffffff811bc1f8>] wb_writeback+0x218/0x420
[586898.273389]  [<ffffffff814e637e>] ? thread_return+0x4e/0x7d0
[586898.273394]  [<ffffffff811bc5a9>] wb_do_writeback+0x1a9/0x250
[586898.273402]  [<ffffffff8107e2e0>] ? process_timeout+0x0/0x10
[586898.273407]  [<ffffffff811bc6b3>] bdi_writeback_task+0x63/0x1b0
[586898.273412]  [<ffffffff810953e7>] ? bit_waitqueue+0x17/0xc0
[586898.273419]  [<ffffffff8114ce80>] ? bdi_start_fn+0x0/0x100
[586898.273424]  [<ffffffff8114cf06>] bdi_start_fn+0x86/0x100
[586898.273429]  [<ffffffff8114ce80>] ? bdi_start_fn+0x0/0x100
[586898.273434]  [<ffffffff81094f36>] kthread+0x96/0xa0
[586898.273440]  [<ffffffff8100c20a>] child_rip+0xa/0x20
[586898.273445]  [<ffffffff81094ea0>] ? kthread+0x0/0xa0
[586898.273450]  [<ffffffff8100c200>] ? child_rip+0x0/0x20
[root@server-10 ~]#



Here are the file sizes. Secure was big, but was hung for quite a long time:

-rw------- 1 root root         0 Dec 20 10:17 boot.log
-rw------- 1 root utmp 281079168 Jun 15 21:53 btmp
-rw------- 1 root root    337661 Jun 16 16:36 cron
-rw-r--r-- 1 root root         0 Jun  9 18:33 dmesg
-rw-r--r-- 1 root root         0 Jun  9 16:19 dmesg.old
-rw-r--r-- 1 root root     98585 Dec 21 14:32 dracut.log
drwxr-xr-x 5 root root      4096 Dec 21 16:53 glusterfs
drwx------ 2 root root      4096 Mar  1 16:11 httpd
-rw-r--r-- 1 root root    146000 Jun 16 13:36 lastlog
drwxr-xr-x 2 root root      4096 Dec 20 10:35 mail
-rw------- 1 root root   1072902 Jun  9 18:33 maillog
-rw------- 1 root root     50638 Jun 16 12:13 messages
drwxr-xr-x 2 root root      4096 Dec 30 16:14 nginx
drwx------ 3 root root      4096 Dec 20 10:35 samba
-rw------- 1 root root 222214339 Jun 16 13:37 secure
-rw------- 1 root root         0 Sep 13  2011 spooler
-rw------- 1 root root         0 Sep 13  2011 tallylog
-rw-rw-r-- 1 root utmp    114432 Jun 16 13:37 wtmp
-rw------- 1 root root      7015 Jun 16 12:13 yum.log

On 06/16/2012 05:04 PM, Anand Avati wrote:

Was there anything in dmesg on the servers? If you are able to reproduce the hang, can you
get the output of 'gluster volume status <name> callpool' and 'gluster volume status
<name> nfs callpool' ?

How big is the 'log/secure' file? Is it so large the the client was just busy
writing it for a very long time? Are there any signs of disconnections or ping
tmeouts in the logs?

Avati

On Sat, Jun 16, 2012 at 10:48 AM, Sean Fulton <[email protected]> wrote:
I do not mean to be argumentative, but I have to admit a little frustration
with Gluster. I know an enormous emount of effort has gone into this product,
and I just can't believe that with all the effort behind it and so many people
using it, it could be so fragile.

So here goes. Perhaps someone here can point to the error of my ways. I really
want this to work because it would be ideal for our environment, but ...

Please note that all of the nodes below are OpenVZ nodes with nfs/nfsd/fuse
modules loaded on the hosts.

After spending months trying to get 3.2.5 and 3.2.6 working in a production
environment, I gave up on Gluster and went with a Linux-HA/NFS cluster which
just works. The problems I had with gluster were strange lock-ups, split
brains, and too many instances where the whole cluster was off-line until I
reloaded the data.

So wiith the release of 3.3, I decided to give it another try. I created one
relicated volume on my two NFS servers.

I then mounted the volume on a client as follows:
10.10.10.7:/pub2 /pub2 nfs
rw,noacl,noatime,nodiratime,soft,proto=tcp,vers=3,defaults 0 0

I threw some data at it (find / -mount -print | cpio -pvdum /pub2/test)

Within 10 seconds it locked up solid. No error messages on any of the servers,
the client was unresponsive and load on the client was 15+. I restarted
glusterd on both of my NFS servers, and the client remained locked. Finally I
killed the cpio process on the client. When I started another cpio, it runs
further than before, but now the logs on my NFS/Gluster server say:

[2012-06-16 13:37:35.242754] I
[afr-self-heal-common.c:1318:afr_sh_missing_entries_lookup_done] 0-pub2-replicate-0:
No sources for dir of <gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure,
in missing entry self-heal, continuing with the rest of the self-heals
[2012-06-16 13:37:35.243315] I
[afr-self-heal-common.c:994:afr_sh_missing_entries_done] 0-pub2-replicate-0: split
brain found, aborting selfheal of
<gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure
[2012-06-16 13:37:35.243350] E
[afr-self-heal-common.c:2156:afr_self_heal_completion_cbk] 0-pub2-replicate-0:
background data gfid self-heal failed on
<gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure

This still seems to be an INCREDIBLY fragile system. Why would it lock solid
while copying a large file? Why no errors in the logs?

I am the only one seeing this kind of behavior?

sean

--
Sean Fulton
GCN Publishing, Inc.
Internet Design, Development and Consulting For Today's Media Companies
http://www.gcnpublishing.com
(203) 665-6211, x203

_______________________________________________
Gluster-users mailing list
[email protected]
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

--
Sean Fulton
GCN Publishing, Inc.
Internet Design, Development and Consulting For Today's Media Companies

http://www.gcnpublishing.com

(203) 665-6211, x203




_______________________________________________
Gluster-users mailing list
[email protected]
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


--
Sean Fulton
GCN Publishing, Inc.
Internet Design, Development and Consulting For Today's Media Companies
http://www.gcnpublishing.com
(203) 665-6211, x203



_______________________________________________
Gluster-users mailing list
[email protected]
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Not real confident in 3.3

Reply via email to