Re: [Gluster-users] Not real confident in 3.3

Sean Fulton Sun, 17 Jun 2012 05:17:51 -0700

It's a replicated volume, but only one client was writing one process tothe cluster, so I don't understand how you could have a split brain. Theother issue is that while making a tar of the static files on thereplicated volume, I kept getting errors from tar that the file changedas we read it. This was content I had copied *to* the cluster, and onlyone client node was acting on it at a time, so there is no chance anyoneor anything was updating the files. And this error was coming up every 6to 10 files.

All three nodes were part of a Linux-HA NFS cluster that workedflawlessly for weeks, so I feel pretty confident it's not the environment.

I understand the hang could be un-related, but the two things abovecause me concern. Previously when I worked with 3.2.6 and 3.2.6 I had alot of problems with split brains, "No end-point connected" errors,etc., so I gave up on Gluster. The stuff above, in a test environment,makes me wonder. What could cause this in a closed dev env?


sean


On 06/17/2012 03:42 AM, Brian Candler wrote:

On Sat, Jun 16, 2012 at 04:47:51PM -0400, Sean Fulton wrote:

1) The split-brain message is strange because there are only two
server nodes and 1 client node which has mounted the volume via NFS
on a floating IP. This was done to guarantee that only one node gets
written to at any point in time, so there is zero chance that two
nodes were updated simultaneously.

Are you using a distributed volume, or a replicated volume? Writes to a
replicated volume go to both nodes.

    [586898.273283] INFO: task flush-0:45:633954 blocked for more than 120 
seconds.
    [586898.273290] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
    [586898.273295] flush-0:45    D ffff8806037592d0     0 633954      20 
0x00000000
    [586898.273304]  ffff88000d1ebbe0 0000000000000046 ffff88000d1ebd6c 
0000000000000000
    [586898.273312]  ffff88000d1ebce0 ffffffff81054444 ffff88000d1ebc80 
ffff88000d1ebbf0
    [586898.273319]  ffff8806050ac5f8 ffff880603759888 ffff88000d1ebfd8 
ffff88000d1ebfd8
    [586898.273326] Call Trace:
    [586898.273335]  [<ffffffff81054444>] ? find_busiest_group+0x244/0xb20
    [586898.273343]  [<ffffffff811ab050>] ? inode_wait+0x0/0x20
    [586898.273349]  [<ffffffff811ab05e>] inode_wait+0xe/0x20

Are you using XFS by any chance?

I started with XFS, because that was what the gluster docs recommend, but
eventually gave up on it.  I can replicate those sort of kernel lockups on a
24-disk MD array within a short space of time - without gluster, just by
throwing four bonnie++ processes at it.

The same tests run with either ext4 or btrfs do not hang, at least not
during two days of continuous testing.

Of course, any kernel problem cannot be the fault of glusterfs, since
glusterfs runs entirely in userland.

Regards,

Brian.


--
Sean Fulton
GCN Publishing, Inc.
Internet Design, Development and Consulting For Today's Media Companies
http://www.gcnpublishing.com
(203) 665-6211, x203



_______________________________________________
Gluster-users mailing list
[email protected]
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] Not real confident in 3.3

Reply via email to