It's a replicated volume, but only one client was writing one process to
the cluster, so I don't understand how you could have a split brain. The
other issue is that while making a tar of the static files on the
replicated volume, I kept getting errors from tar that the file changed
as we read it. This was content I had copied *to* the cluster, and only
one client node was acting on it at a time, so there is no chance anyone
or anything was updating the files. And this error was coming up every 6
to 10 files.
All three nodes were part of a Linux-HA NFS cluster that worked
flawlessly for weeks, so I feel pretty confident it's not the environment.
I understand the hang could be un-related, but the two things above
cause me concern. Previously when I worked with 3.2.6 and 3.2.6 I had a
lot of problems with split brains, "No end-point connected" errors,
etc., so I gave up on Gluster. The stuff above, in a test environment,
makes me wonder. What could cause this in a closed dev env?
sean
On 06/17/2012 03:42 AM, Brian Candler wrote:
On Sat, Jun 16, 2012 at 04:47:51PM -0400, Sean Fulton wrote:
1) The split-brain message is strange because there are only two
server nodes and 1 client node which has mounted the volume via NFS
on a floating IP. This was done to guarantee that only one node gets
written to at any point in time, so there is zero chance that two
nodes were updated simultaneously.
Are you using a distributed volume, or a replicated volume? Writes to a
replicated volume go to both nodes.
[586898.273283] INFO: task flush-0:45:633954 blocked for more than 120
seconds.
[586898.273290] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[586898.273295] flush-0:45 D ffff8806037592d0 0 633954 20
0x00000000
[586898.273304] ffff88000d1ebbe0 0000000000000046 ffff88000d1ebd6c
0000000000000000
[586898.273312] ffff88000d1ebce0 ffffffff81054444 ffff88000d1ebc80
ffff88000d1ebbf0
[586898.273319] ffff8806050ac5f8 ffff880603759888 ffff88000d1ebfd8
ffff88000d1ebfd8
[586898.273326] Call Trace:
[586898.273335] [<ffffffff81054444>] ? find_busiest_group+0x244/0xb20
[586898.273343] [<ffffffff811ab050>] ? inode_wait+0x0/0x20
[586898.273349] [<ffffffff811ab05e>] inode_wait+0xe/0x20
Are you using XFS by any chance?
I started with XFS, because that was what the gluster docs recommend, but
eventually gave up on it. I can replicate those sort of kernel lockups on a
24-disk MD array within a short space of time - without gluster, just by
throwing four bonnie++ processes at it.
The same tests run with either ext4 or btrfs do not hang, at least not
during two days of continuous testing.
Of course, any kernel problem cannot be the fault of glusterfs, since
glusterfs runs entirely in userland.
Regards,
Brian.
--
Sean Fulton
GCN Publishing, Inc.
Internet Design, Development and Consulting For Today's Media Companies
http://www.gcnpublishing.com
(203) 665-6211, x203
_______________________________________________
Gluster-users mailing list
[email protected]
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users