[Gluster-users] easily provoked unrecoverable split brain

Alexis Huxley Sun, 19 Jun 2016 08:31:39 -0700

As per the quickstart guide, I'm setting up a replicated volume on two
test (KVM) VMs fiori2 and torchio2 as follows:


        mkfs -t xfs -i size=512 -f /dev/vdb1          #  on both
        mount /dev/vdb1 /vol/brick0                   #  on both
        gluster peer probe torchio2                   #  on fiori2
        gluster peer probe fiori2                     #  on torchio2
        mkdir /vol/brick0/vmimages                    #  on both
        gluster volume create vmimages replica 2 \
            torchio2:/vol/brick0/vmimages 
            fiori2:/vol/brick0/vmimages               #  fiori2
        mount -t glusterfs fiori2:/vmimages /mnt      #  on both

Then I pull the virtual network cable out of one host (with 'virsh
domif-setlink fiori2 vnet10 down') and then run:

        ls /mnt                                       #  on both (wait for 
timeouts to elapse)
        uname -n > /mnt/hostname                      #  on both (create 
conflict)

Then I put the cable back, wait a bit and then run:

        torchio2# cat /mnt/hostname
        cat: /mnt/hostname: Input/output error
        torchio2# 

I'm deliberately trying to provoke split-brain, so this I/O error
is no surprise.

The real problem comes when I try to recover from it:

        fiori2# gluster volume heal vmimages info
        Brick torchio2:/vol/brick0/vmimages
        / - Is in split-brain
        
        /hostname 
        Number of entries: 2
        
        Brick fiori2:/vol/brick0/vmimages
        / - Is in split-brain
        
        /hostname 
        Number of entries: 2
        
        fiori2# gluster volume heal vmimages split-brain source-brick 
torchio2:/vol/brick0/vmimages 
        'source-brick' option used on a directory 
(gfid:00000000-0000-0000-0000-000000000001). Performing conservative merge.
        Healing gfid:00000000-0000-0000-0000-000000000001 failed:Operation not 
permitted.
        Healing gfid:73dce70e-bb3e-40a2-bec9-4741399b6b72 failed:Transport 
endpoint is not connected.
        Number of healed entries: 0
        fiori2# 

and the I/O error remains.

I've also tried it the manual/fattr way, but that itself also
produces I/O errors:

        fiori2# getfattr -d -m . -e hex /mnt/hostname
        getfattr: /mnt/hostname: Input/output error
        fiori2# 

I've done some googling, but not turned up any references to
split-brain with "Operation not permitted" or "Transport endpoint is
not connected". Am I doing something wrong? Is this a known bug?
Is there a workaround?

For info, I'm using:

        fiori2# cat /etc/issue
        Ubuntu 16.04 LTS \n \l
        
        fiori2# uname -a
        Linux fiori2 4.4.0-24-generic #43-Ubuntu SMP Wed Jun 8 19:27:37 UTC 
2016 x86_64 x86_64 x86_64 GNU/Linux
        fiori2# dpkg -l | grep gluster
        ii  glusterfs-client                   3.7.6-1ubuntu1                  
amd64        clustered file-system (client package)
        ii  glusterfs-common                   3.7.6-1ubuntu1                  
amd64        GlusterFS common libraries and translator modules
        ii  glusterfs-server                   3.7.6-1ubuntu1                  
amd64        clustered file-system (server package)
        fiori2# 

I understand that two nodes are not optimal; occassional split-brain 
is acceptable so long as I can recover from it. Up to now, for
a clustered filesystem on my VM servers, I've been using DRBD+OCFS2,
but the NFS3 interaction has been glitchy, so now I'm doing some
tests with GlusterFS.

Any advice gratefully received! Thanks!

Alexis
_______________________________________________
Gluster-users mailing list
[email protected]
http://www.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] easily provoked unrecoverable split brain

Reply via email to