Dear All,
we are facing a problem in our computer room, we have 6 servers that act like
bricks for GlusterFS, the servers are configured in the following way:
OS: Centos 6.2 x86_64
Kernel: 2.6.32-220.4.2.el6.x86_64
Gluster RPM packages:
glusterfs-core-3.2.5-2.el6.x86_64
glusterfs-rdma-3.2.5-2.el6.x86_64
glusterfs-geo-replication-3.2.5-2.el6.x86_64
glusterfs-fuse-3.2.5-2.el6.x86_64
Each one is contributing a XFS filesystem to the global volume, the transport
mechanism is RDMA:
gluster volume create HPC_data transport rdma pleiades01:/data pleiades02:/data
pleiades03:/data pleiades04:/data pleiades05:/data pleiades06:/data
Each server mounts, using the fuse driver, the volume on a dedicated mount
point according to the following fstab:
pleiades01:/HPC_data /HPCdata glusterfs defaults,_netdev
0 0
We are running mongodb on top of the Gluster volume for performance testing and
speed is definitely high. Unfortunately when we run a large mongoimport job
after short time from the beginning the GlusterFS volume hangs completely and
is inaccessible from any node. The following error is logged after some time in
/var/log/messages:
Mar 8 08:16:03 pleiades03 kernel: INFO: task mongod:5508 blocked for more than
120 seconds.
Mar 8 08:16:03 pleiades03 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 8 08:16:03 pleiades03 kernel: mongod D 0000000000000007 0 5508
1 0x00000000
Mar 8 08:16:03 pleiades03 kernel: ffff881709b95de8 0000000000000086
0000000000000000 0000000000000008
Mar 8 08:16:03 pleiades03 kernel: ffff881709b95d68 ffffffff81090a7f
ffff8816b6974cc0 0000000000000000
Mar 8 08:16:03 pleiades03 kernel: ffff8817fdd81af8 ffff881709b95fd8
000000000000f4e8 ffff8817fdd81af8
Mar 8 08:16:03 pleiades03 kernel: Call Trace:
Mar 8 08:16:03 pleiades03 kernel: [<ffffffff81090a7f>] ? wake_up_bit+0x2f/0x40
Mar 8 08:16:03 pleiades03 kernel: [<ffffffff81090d7e>] ?
prepare_to_wait+0x4e/0x80
Mar 8 08:16:03 pleiades03 kernel: [<ffffffffa112c6b5>]
fuse_set_nowrite+0xa5/0xe0 [fuse]
Mar 8 08:16:03 pleiades03 kernel: [<ffffffff81090a90>] ?
autoremove_wake_function+0x0/0x40
Mar 8 08:16:03 pleiades03 kernel: [<ffffffffa112fd48>]
fuse_fsync_common+0xa8/0x180 [fuse]
Mar 8 08:16:03 pleiades03 kernel: [<ffffffffa112fe30>] fuse_fsync+0x10/0x20
[fuse]
Mar 8 08:16:03 pleiades03 kernel: [<ffffffff811a52d1>]
vfs_fsync_range+0xa1/0xe0
Mar 8 08:16:03 pleiades03 kernel: [<ffffffff811a537d>] vfs_fsync+0x1d/0x20
Mar 8 08:16:03 pleiades03 kernel: [<ffffffff81144421>] sys_msync+0x151/0x1e0
Mar 8 08:16:03 pleiades03 kernel: [<ffffffff8100b0f2>]
system_call_fastpath+0x16/0x1b
Any attempt to access the volume from any node is fruitless until the mongodb
process is killed, the sessions accessing the /HPCdata path gets freezed on any
node.
Anyway a complete stop (force) and start of the volume is needed to have it
back operational.
The situation can be reproduced at will.
Is there anybody able to help us? Could we collect more pieces of information
to help diagnosing the problem?
Thanks a lot
Alessio
_______________________________________________
Gluster-users mailing list
[email protected]
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users