Dear All,
we are facing a problem in our computer room, we have 6 servers that act like 
bricks for GlusterFS, the servers are configured in the following way:

OS: Centos 6.2 x86_64
Kernel: 2.6.32-220.4.2.el6.x86_64

Gluster RPM packages:
glusterfs-core-3.2.5-2.el6.x86_64
glusterfs-rdma-3.2.5-2.el6.x86_64
glusterfs-geo-replication-3.2.5-2.el6.x86_64
glusterfs-fuse-3.2.5-2.el6.x86_64

Each one is contributing a XFS filesystem to the global volume, the transport 
mechanism is RDMA:

gluster volume create HPC_data transport rdma pleiades01:/data pleiades02:/data 
pleiades03:/data pleiades04:/data pleiades05:/data pleiades06:/data

Each server mounts, using the fuse driver, the volume on a dedicated mount 
point according to the following fstab:

pleiades01:/HPC_data        /HPCdata                glusterfs defaults,_netdev 
0 0

We are running mongodb on top of the Gluster volume for performance testing and 
speed is definitely high. Unfortunately when we run a large mongoimport job 
after short time from the beginning the GlusterFS volume hangs completely and 
is inaccessible from any node. The following error is logged after some time in 
/var/log/messages:

Mar  8 08:16:03 pleiades03 kernel: INFO: task mongod:5508 blocked for more than 
120 seconds.
Mar  8 08:16:03 pleiades03 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  8 08:16:03 pleiades03 kernel: mongod        D 0000000000000007     0  5508 
     1 0x00000000
Mar  8 08:16:03 pleiades03 kernel: ffff881709b95de8 0000000000000086 
0000000000000000 0000000000000008
Mar  8 08:16:03 pleiades03 kernel: ffff881709b95d68 ffffffff81090a7f 
ffff8816b6974cc0 0000000000000000
Mar  8 08:16:03 pleiades03 kernel: ffff8817fdd81af8 ffff881709b95fd8 
000000000000f4e8 ffff8817fdd81af8
Mar  8 08:16:03 pleiades03 kernel: Call Trace:
Mar  8 08:16:03 pleiades03 kernel: [<ffffffff81090a7f>] ? wake_up_bit+0x2f/0x40
Mar  8 08:16:03 pleiades03 kernel: [<ffffffff81090d7e>] ? 
prepare_to_wait+0x4e/0x80
Mar  8 08:16:03 pleiades03 kernel: [<ffffffffa112c6b5>] 
fuse_set_nowrite+0xa5/0xe0 [fuse]
Mar  8 08:16:03 pleiades03 kernel: [<ffffffff81090a90>] ? 
autoremove_wake_function+0x0/0x40
Mar  8 08:16:03 pleiades03 kernel: [<ffffffffa112fd48>] 
fuse_fsync_common+0xa8/0x180 [fuse]
Mar  8 08:16:03 pleiades03 kernel: [<ffffffffa112fe30>] fuse_fsync+0x10/0x20 
[fuse]
Mar  8 08:16:03 pleiades03 kernel: [<ffffffff811a52d1>] 
vfs_fsync_range+0xa1/0xe0
Mar  8 08:16:03 pleiades03 kernel: [<ffffffff811a537d>] vfs_fsync+0x1d/0x20
Mar  8 08:16:03 pleiades03 kernel: [<ffffffff81144421>] sys_msync+0x151/0x1e0
Mar  8 08:16:03 pleiades03 kernel: [<ffffffff8100b0f2>] 
system_call_fastpath+0x16/0x1b

Any attempt to access the volume from any node is fruitless until the mongodb 
process is killed, the sessions accessing the /HPCdata path gets freezed on any 
node. 
Anyway a complete stop (force) and start of the volume is needed to have it 
back operational.
The situation can be reproduced at will.
Is there anybody able to help us? Could we collect more pieces of information 
to help diagnosing the problem?

Thanks a lot
Alessio 

_______________________________________________
Gluster-users mailing list
[email protected]
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Reply via email to