Re: [Gluster-users] split-brain on glusterfs running with quorum on server and client

Pranith Kumar Karampuri Fri, 19 Sep 2014 20:19:55 -0700


On 09/19/2014 09:58 PM, Ramesh Natarajan wrote:

I was able to run another set of tests this week and I was able toreproduce the issue again. Going by the extended attributes, I think iran into the same issue I saw earlier..
 Do you think i need to open up a bug report?

hi Ramesh,

I already fixed this bug. http://review.gluster.org/8757. Weshould have the fix in next 3.5.x release I believe.


Pranith


Brick 1:

trusted.afr.PL2-client-0=0x000000000000000000000000
trusted.afr.PL2-client-1=0x000000010000000000000000
trusted.afr.PL2-client-2=0x000000010000000000000000
trusted.gfid=0x1cea509b07cc49e9bd28560b5f33032c

Brick 2

trusted.afr.PL2-client-0=0x0000125c0000000000000000
trusted.afr.PL2-client-1=0x000000000000000000000000
trusted.afr.PL2-client-2=0x000000000000000000000000
trusted.gfid=0x1cea509b07cc49e9bd28560b5f33032c

Brick 3

trusted.afr.PL2-client-0=0x0000125c0000000000000000
trusted.afr.PL2-client-1=0x000000000000000000000000
trusted.afr.PL2-client-2=0x000000000000000000000000
trusted.gfid=0x1cea509b07cc49e9bd28560b5f33032c


[root@ip-172-31-12-218 ~]# gluster volume info
Volume Name: PL1
Type: Replicate
Volume ID: bd351bae-d467-4e8c-bbd2-6a0fe99c346a
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 172.31.38.189:/data/vol1/gluster-data
Brick2: 172.31.16.220:/data/vol1/gluster-data
Brick3: 172.31.12.218:/data/vol1/gluster-data
Options Reconfigured:
cluster.server-quorum-type: server
network.ping-timeout: 12
nfs.addr-namelookup: off
performance.cache-size: 2147483648
cluster.quorum-type: auto
performance.read-ahead: off
performance.client-io-threads: on
performance.io-thread-count: 64
cluster.eager-lock: on
cluster.server-quorum-ratio: 51%
Volume Name: PL2
Type: Replicate
Volume ID: e6ad8787-05d8-474b-bc78-748f8c13700f
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 172.31.38.189:/data/vol2/gluster-data
Brick2: 172.31.16.220:/data/vol2/gluster-data
Brick3: 172.31.12.218:/data/vol2/gluster-data
Options Reconfigured:
nfs.addr-namelookup: off
cluster.server-quorum-type: server
network.ping-timeout: 12
performance.cache-size: 2147483648
cluster.quorum-type: auto
performance.read-ahead: off
performance.client-io-threads: on
performance.io-thread-count: 64
cluster.eager-lock: on
cluster.server-quorum-ratio: 51%
[root@ip-172-31-12-218 ~]#

*Mount command*

Client

mount -t glusterfs -odefaults,enable-ino32,direct-io-mode=disable,log-level=WARNING,log-file=/var/log/gluster.log,backupvolfile-server=172.31.38.189,backupvolfile-server=172.31.12.218,background-qlen=256172.31.16.220:/PL2 /mnt/vm


Server

/dev/xvdf    /data/vol1 xfs defaults,inode64,noatime 1 2
/dev/xvdg   /data/vol2 xfs defaults,inode64,noatime 1 2

*Packages*

Client

rpm -qa | grep gluster
glusterfs-fuse-3.5.2-1.el6.x86_64
glusterfs-3.5.2-1.el6.x86_64
glusterfs-libs-3.5.2-1.el6.x86_64

Server

[root@ip-172-31-12-218 ~]# rpm -qa | grep gluster
glusterfs-3.5.2-1.el6.x86_64
glusterfs-fuse-3.5.2-1.el6.x86_64
glusterfs-api-3.5.2-1.el6.x86_64
glusterfs-server-3.5.2-1.el6.x86_64
glusterfs-libs-3.5.2-1.el6.x86_64
glusterfs-cli-3.5.2-1.el6.x86_64
[root@ip-172-31-12-218 ~]#

On Sat, Sep 6, 2014 at 9:01 AM, Pranith Kumar Karampuri<[email protected] <mailto:[email protected]>> wrote:



    On 09/06/2014 04:53 AM, Jeff Darcy wrote:

            I have a replicate glusterfs setup on 3 Bricks ( replicate
            = 3 ). I have
            client and server quorum turned on. I rebooted one of the
            3 bricks. When it
            came back up, the client started throwing error messages
            that one of the
            files went into split brain.

        This is a good example of how split brain can happen even with
        all kinds of
        quorum enabled.  Let's look at those xattrs.  BTW, thank you
        for a very
        nicely detailed bug report which includes those.

            BRICK1
            ========
            [root@ip-172-31-38-189 ~]# getfattr -d -m . -e hex
            /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17
            <tel:2014-09-05-17>_00_00
            getfattr: Removing leading '/' from absolute path names
            # file:
            data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17
            <tel:2014-09-05-17>_00_00
            trusted.afr.PL2-client-0=0x000000000000000000000000
            trusted.afr.PL2-client-1=0x000000010000000000000000
            trusted.afr.PL2-client-2=0x000000010000000000000000
            trusted.gfid=0xea950263977e46bf89a0ef631ca139c2

            BRICK 2
            =======
            [root@ip-172-31-16-220 ~]# getfattr -d -m . -e hex
            /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17
            <tel:2014-09-05-17>_00_00
            getfattr: Removing leading '/' from absolute path names
            # file:
            data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17
            <tel:2014-09-05-17>_00_00
            trusted.afr.PL2-client-0=0x00000d460000000000000000
            trusted.afr.PL2-client-1=0x000000000000000000000000
            trusted.afr.PL2-client-2=0x000000000000000000000000
            trusted.gfid=0xea950263977e46bf89a0ef631ca139c2
            BRICK 3
            =========
            [root@ip-172-31-12-218 ~]# getfattr -d -m . -e hex
            /data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17
            <tel:2014-09-05-17>_00_00
            getfattr: Removing leading '/' from absolute path names
            # file:
            data/vol2/gluster-data/apache_cp_mm1/logs/access_log.2014-09-05-17
            <tel:2014-09-05-17>_00_00
            trusted.afr.PL2-client-0=0x00000d460000000000000000
            trusted.afr.PL2-client-1=0x000000000000000000000000
            trusted.afr.PL2-client-2=0x000000000000000000000000
            trusted.gfid=0xea950263977e46bf89a0ef631ca139c2

        Here, we see that brick 1 shows a single pending operation for
        the other
        two, while they show 0xd46 (3398) pending operations for brick 1.
        Here's how this can happen.

        (1) There is exactly one pending operation.

        (2) Brick1 completes the write first, and says so.

        (3) Client sends messages to all three, saying to decrement
        brick1's
        count.

        (4) All three bricks receive and process that message.

        (5) Brick1 fails.

        (6) Brick2 and brick3 complete the write, and say so.

        (7) Client tells all bricks to decrement remaining counts.

        (8) Brick2 and brick3 receive and process that message.

        (9) Brick1 is dead, so its counts for brick2/3 stay at one.

        (10) Brick2 and brick3 have quorum, with all-zero pending
        counters.

        (11) Client sends 0xd46 more writes to brick2 and brick3.

        Note that at no point did we lose quorum. Note also the tight
        timing
        required.  If brick1 had failed an instant earlier, it would
        not have
        decremented its own counter.  If it had failed an instant
        later, it
        would have decremented brick2's and brick3's as well. If
        brick1 had not
        finished first, we'd be in yet another scenario.  If delayed
        changelog
        had been operative, the messages at (3) and (7) would have
        been combined
        to leave us in yet another scenario.  As far as I can tell, we
        would
        have been able to resolve the conflict in all those cases.
        *** Key point: quorum enforcement does not totally eliminate split
        brain.  It only makes the frequency a few orders of magnitude
        lower. ***


    Not quite right. After we fixed the bug
    https://bugzilla.redhat.com/show_bug.cgi?id=1066996, the only two
    possible ways to introduce split-brain are
    1) if we have an implementation bug in changelog xattr marking, I
    believe that to be the case here.
    2) Keep writing to the file from the mount then
    a) take brick 1 down, wait until at least one write is successful
    b) bring brick1 back up and take brick 2 down (self-heal should
    not happen) wait until at least one write is successful
    c) bring brick2 back up and take brick 3 down (self-heal should
    not happen) wait until at least one write is successful

    With outcast implementation case-2 will also be immune to
    split-brain errors.

    Then the only way we have split-brains in afr is implementation
    errors of changelog marking. If we test it thoroughly and fix such
    problems we can get it to be immune to split-brain :-).

    Pranith

        So, is there any way to prevent this completely?  Some AFR
        enhancements,
        such as the oft-promised "outcast" feature[1], might have helped.
        NSR[2] is immune to this particular problem.  "Policy based
        split brain
        resolution"[3] might have resolved it automatically instead of
        merely
        flagging it.  Unfortunately, those are all in the future.  For
        now, I'd
        say the best approach is to resolve the conflict manually and
        try to
        move on.  Unless there's more going on than meets the eye,
        recurrence
        should be very unlikely.

        [1]
        
http://www.gluster.org/community/documentation/index.php/Features/outcast

        [2]
        
http://www.gluster.org/community/documentation/index.php/Features/new-style-replication

        [3]
        http://www.gluster.org/community/documentation/index.php/Features/pbspbr
        _______________________________________________
        Gluster-users mailing list
        [email protected] <mailto:[email protected]>
        http://supercolony.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] split-brain on glusterfs running with quorum on server and client

Reply via email to