Let me re-iterate, I really, really want to see Gluster work for our
environment. I am hopeful this is something I did or something that can
be easily fixed.
Yes, there was an error on the client server:
[586898.273283] INFO: task flush-0:45:633954 blocked for more than 120
seconds.
[586898.273290] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[586898.273295] flush-0:45 D ffff8806037592d0 0 633954 2 0
0x00000000
[586898.273304] ffff88000d1ebbe0 0000000000000046 ffff88000d1ebd6c
0000000000000000
[586898.273312] ffff88000d1ebce0 ffffffff81054444 ffff88000d1ebc80
ffff88000d1ebbf0
[586898.273319] ffff8806050ac5f8 ffff880603759888 ffff88000d1ebfd8
ffff88000d1ebfd8
[586898.273326] Call Trace:
[586898.273335] [<ffffffff81054444>] ? find_busiest_group+0x244/0xb20
[586898.273343] [<ffffffff811ab050>] ? inode_wait+0x0/0x20
[586898.273349] [<ffffffff811ab05e>] inode_wait+0xe/0x20
[586898.273357] [<ffffffff814e752f>] __wait_on_bit+0x5f/0x90
[586898.273365] [<ffffffff811bbd6c>] ? writeback_sb_inodes+0x13c/0x210
[586898.273370] [<ffffffff811bab28>] inode_wait_for_writeback+0x98/0xc0
[586898.273377] [<ffffffff81095550>] ? wake_bit_function+0x0/0x50
[586898.273382] [<ffffffff811bc1f8>] wb_writeback+0x218/0x420
[586898.273389] [<ffffffff814e637e>] ? thread_return+0x4e/0x7d0
[586898.273394] [<ffffffff811bc5a9>] wb_do_writeback+0x1a9/0x250
[586898.273402] [<ffffffff8107e2e0>] ? process_timeout+0x0/0x10
[586898.273407] [<ffffffff811bc6b3>] bdi_writeback_task+0x63/0x1b0
[586898.273412] [<ffffffff810953e7>] ? bit_waitqueue+0x17/0xc0
[586898.273419] [<ffffffff8114ce80>] ? bdi_start_fn+0x0/0x100
[586898.273424] [<ffffffff8114cf06>] bdi_start_fn+0x86/0x100
[586898.273429] [<ffffffff8114ce80>] ? bdi_start_fn+0x0/0x100
[586898.273434] [<ffffffff81094f36>] kthread+0x96/0xa0
[586898.273440] [<ffffffff8100c20a>] child_rip+0xa/0x20
[586898.273445] [<ffffffff81094ea0>] ? kthread+0x0/0xa0
[586898.273450] [<ffffffff8100c200>] ? child_rip+0x0/0x20
[root@server-10 ~]#
Here are the file sizes. Secure was big, but was hung for quite a long time:
-rw------- 1 root root 0 Dec 20 10:17 boot.log
-rw------- 1 root utmp 281079168 Jun 15 21:53 btmp
-rw------- 1 root root 337661 Jun 16 16:36 cron
-rw-r--r-- 1 root root 0 Jun 9 18:33 dmesg
-rw-r--r-- 1 root root 0 Jun 9 16:19 dmesg.old
-rw-r--r-- 1 root root 98585 Dec 21 14:32 dracut.log
drwxr-xr-x 5 root root 4096 Dec 21 16:53 glusterfs
drwx------ 2 root root 4096 Mar 1 16:11 httpd
-rw-r--r-- 1 root root 146000 Jun 16 13:36 lastlog
drwxr-xr-x 2 root root 4096 Dec 20 10:35 mail
-rw------- 1 root root 1072902 Jun 9 18:33 maillog
-rw------- 1 root root 50638 Jun 16 12:13 messages
drwxr-xr-x 2 root root 4096 Dec 30 16:14 nginx
drwx------ 3 root root 4096 Dec 20 10:35 samba
-rw------- 1 root root 222214339 Jun 16 13:37 secure
-rw------- 1 root root 0 Sep 13 2011 spooler
-rw------- 1 root root 0 Sep 13 2011 tallylog
-rw-rw-r-- 1 root utmp 114432 Jun 16 13:37 wtmp
-rw------- 1 root root 7015 Jun 16 12:13 yum.log
On 06/16/2012 05:04 PM, Anand Avati wrote:
Was there anything in dmesg on the servers? If you are able to
reproduce the hang, can you get the output of 'gluster volume status
<name> callpool' and 'gluster volume status <name> nfs callpool' ?
How big is the 'log/secure' file? Is it so large the the client was
just busy writing it for a very long time? Are there any signs of
disconnections or ping tmeouts in the logs?
Avati
On Sat, Jun 16, 2012 at 10:48 AM, Sean Fulton <[email protected]
<mailto:[email protected]>> wrote:
I do not mean to be argumentative, but I have to admit a little
frustration with Gluster. I know an enormous emount of effort has
gone into this product, and I just can't believe that with all the
effort behind it and so many people using it, it could be so fragile.
So here goes. Perhaps someone here can point to the error of my
ways. I really want this to work because it would be ideal for our
environment, but ...
Please note that all of the nodes below are OpenVZ nodes with
nfs/nfsd/fuse modules loaded on the hosts.
After spending months trying to get 3.2.5 and 3.2.6 working in a
production environment, I gave up on Gluster and went with a
Linux-HA/NFS cluster which just works. The problems I had with
gluster were strange lock-ups, split brains, and too many
instances where the whole cluster was off-line until I reloaded
the data.
So wiith the release of 3.3, I decided to give it another try. I
created one relicated volume on my two NFS servers.
I then mounted the volume on a client as follows:
10.10.10.7:/pub2 /pub2 nfs
rw,noacl,noatime,nodiratime,soft,proto=tcp,vers=3,defaults 0 0
I threw some data at it (find / -mount -print | cpio -pvdum
/pub2/test)
Within 10 seconds it locked up solid. No error messages on any of
the servers, the client was unresponsive and load on the client
was 15+. I restarted glusterd on both of my NFS servers, and the
client remained locked. Finally I killed the cpio process on the
client. When I started another cpio, it runs further than before,
but now the logs on my NFS/Gluster server say:
[2012-06-16 13:37:35.242754] I
[afr-self-heal-common.c:1318:afr_sh_missing_entries_lookup_done]
0-pub2-replicate-0: No sources for dir of
<gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure, in missing
entry self-heal, continuing with the rest of the self-heals
[2012-06-16 13:37:35.243315] I
[afr-self-heal-common.c:994:afr_sh_missing_entries_done]
0-pub2-replicate-0: split brain found, aborting selfheal of
<gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure
[2012-06-16 13:37:35.243350] E
[afr-self-heal-common.c:2156:afr_self_heal_completion_cbk]
0-pub2-replicate-0: background data gfid self-heal failed on
<gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure
This still seems to be an INCREDIBLY fragile system. Why would it
lock solid while copying a large file? Why no errors in the logs?
I am the only one seeing this kind of behavior?
sean
--
Sean Fulton
GCN Publishing, Inc.
Internet Design, Development and Consulting For Today's Media
Companies
http://www.gcnpublishing.com
(203) 665-6211, x203 <tel:%28203%29%20665-6211%2C%20x203>
_______________________________________________
Gluster-users mailing list
[email protected] <mailto:[email protected]>
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
--
Sean Fulton
GCN Publishing, Inc.
Internet Design, Development and Consulting For Today's Media Companies
http://www.gcnpublishing.com
(203) 665-6211, x203
_______________________________________________
Gluster-users mailing list
[email protected]
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users