Re: [Gluster-users] libgfapi failover problem on replica bricks

Pranith Kumar Karampuri Wed, 06 Aug 2014 05:12:07 -0700


On 08/05/2014 02:33 PM, Roman wrote:

Waited long enough for now, still different sizes and no logs abouthealing :(


stor1
# file: exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
trusted.afr.HA-fast-150G-PVE1-client-0=0x000000000000000000000000
trusted.afr.HA-fast-150G-PVE1-client-1=0x000000000000000000000000
trusted.gfid=0xf10ad81b58484bcd9b385a36a207f921

root@stor1:~# du -sh /exports/fast-test/150G/images/127/
1.2G    /exports/fast-test/150G/images/127/


stor2
# file: exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
trusted.afr.HA-fast-150G-PVE1-client-0=0x000000000000000000000000
trusted.afr.HA-fast-150G-PVE1-client-1=0x000000000000000000000000
trusted.gfid=0xf10ad81b58484bcd9b385a36a207f921


root@stor2:~# du -sh /exports/fast-test/150G/images/127/
1.4G    /exports/fast-test/150G/images/127/

According to the changelogs, the file doesn't need any healing. Couldyou stop the operations on the VMs and take md5sum on both these machines?


Pranith

2014-08-05 11:49 GMT+03:00 Pranith Kumar Karampuri<[email protected] <mailto:[email protected]>>:



    On 08/05/2014 02:06 PM, Roman wrote:

    Well, it seems like it doesn't see the changes were made to the
    volume ? I created two files 200 and 100 MB (from /dev/zero)
    after I disconnected the first brick. Then connected it back and
    got these logs:

    [2014-08-05 08:30:37.830150] I
    [glusterfsd-mgmt.c:1584:mgmt_getspec_cbk] 0-glusterfs: No change
    in volfile, continuing
    [2014-08-05 08:30:37.830207] I
    [rpc-clnt.c:1676:rpc_clnt_reconfig] 0-HA-fast-150G-PVE1-client-0:
    changing port to 49153 (from 0)
    [2014-08-05 08:30:37.830239] W [socket.c:514:__socket_rwv]
    0-HA-fast-150G-PVE1-client-0: readv failed (No data available)
    [2014-08-05 08:30:37.831024] I
    [client-handshake.c:1659:select_server_supported_programs]
    0-HA-fast-150G-PVE1-client-0: Using Program GlusterFS 3.3, Num
    (1298437), Version (330)
    [2014-08-05 08:30:37.831375] I
    [client-handshake.c:1456:client_setvolume_cbk]
    0-HA-fast-150G-PVE1-client-0: Connected to 10.250.0.1:49153
    <http://10.250.0.1:49153>, attached to remote volume
    '/exports/fast-test/150G'.
    [2014-08-05 08:30:37.831394] I
    [client-handshake.c:1468:client_setvolume_cbk]
    0-HA-fast-150G-PVE1-client-0: Server and Client lk-version
    numbers are not same, reopening the fds
    [2014-08-05 08:30:37.831566] I
    [client-handshake.c:450:client_set_lk_version_cbk]
    0-HA-fast-150G-PVE1-client-0: Server lk version = 1


    [2014-08-05 08:30:37.830150] I
    [glusterfsd-mgmt.c:1584:mgmt_getspec_cbk] 0-glusterfs: No change
    in volfile, continuing
    this line seems weird to me tbh.
    I do not see any traffic on switch interfaces between gluster
    servers, which means, there is no syncing between them.
    I tried to ls -l the files on the client and servers to trigger
    the healing, but seems like no success. Should I wait more?

    Yes, it should take around 10-15 minutes. Could you provide
    'getfattr -d -m. -e hex <file-on-brick>' on both the bricks.

    Pranith



    2014-08-05 11:25 GMT+03:00 Pranith Kumar Karampuri
    <[email protected] <mailto:[email protected]>>:


        On 08/05/2014 01:10 PM, Roman wrote:

        Ahha! For some reason I was not able to start the VM
        anymore, Proxmox VE told me, that it is not able to read the
        qcow2 header due to permission is denied for some reason. So
        I just deleted that file and created a new VM. And the nex
        message I've got was this:

        Seems like these are the messages where you took down the
        bricks before self-heal. Could you restart the run waiting
        for self-heals to complete before taking down the next brick?

        Pranith



        [2014-08-05 07:31:25.663412] E
        [afr-self-heal-common.c:197:afr_sh_print_split_brain_log]
        0-HA-fast-150G-PVE1-replicate-0: Unable to self-heal
        contents of '/images/124/vm-124-disk-1.qcow2' (possible
        split-brain). Please delete the file from all but the
        preferred subvolume.- Pending matrix:  [ [ 0 60 ] [ 11 0 ] ]
        [2014-08-05 07:31:25.663955] E
        [afr-self-heal-common.c:2262:afr_self_heal_completion_cbk]
        0-HA-fast-150G-PVE1-replicate-0: background  data self-heal
        failed on /images/124/vm-124-disk-1.qcow2



        2014-08-05 10:13 GMT+03:00 Pranith Kumar Karampuri
        <[email protected] <mailto:[email protected]>>:

            I just responded to your earlier mail about how the log
            looks. The log comes on the mount's logfile

            Pranith

            On 08/05/2014 12:41 PM, Roman wrote:

            Ok, so I've waited enough, I think. Had no any traffic
            on switch ports between servers. Could not find any
            suitable log message about completed self-heal (waited
            about 30 minutes). Plugged out the other server's UTP
            cable this time and got in the same situation:
            root@gluster-test1:~# cat /var/log/dmesg
            -bash: /bin/cat: Input/output error

            brick logs:
            [2014-08-05 07:09:03.005474] I
            [server.c:762:server_rpc_notify]
            0-HA-fast-150G-PVE1-server: disconnecting
            connectionfrom
            pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0
            [2014-08-05 07:09:03.005530] I
            [server-helpers.c:729:server_connection_put]
            0-HA-fast-150G-PVE1-server: Shutting down connection
            pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0
            [2014-08-05 07:09:03.005560] I
            [server-helpers.c:463:do_fd_cleanup]
            0-HA-fast-150G-PVE1-server: fd cleanup on
            /images/124/vm-124-disk-1.qcow2
            [2014-08-05 07:09:03.005797] I
            [server-helpers.c:617:server_connection_destroy]
            0-HA-fast-150G-PVE1-server: destroyed connection of
            pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0





            2014-08-05 9:53 GMT+03:00 Pranith Kumar Karampuri
            <[email protected] <mailto:[email protected]>>:

                Do you think it is possible for you to do these
                tests on the latest version 3.5.2? 'gluster volume
                heal <volname> info' would give you that
                information in versions > 3.5.1.
                Otherwise you will have to check it from either the
                logs, there will be self-heal completed message on
                the mount logs (or) by observing 'getfattr -d -m.
                -e hex <image-file-on-bricks>'

                Pranith


                On 08/05/2014 12:09 PM, Roman wrote:

                Ok, I understand. I will try this shortly.
                How can I be sure, that healing process is done,
                if I am not able to see its status?


                2014-08-05 9:30 GMT+03:00 Pranith Kumar Karampuri
                <[email protected] <mailto:[email protected]>>:

                    Mounts will do the healing, not the
                    self-heal-daemon. The problem I feel is that
                    whichever process does the healing has the
                    latest information about the good bricks in
                    this usecase. Since for VM usecase, mounts
                    should have the latest information, we should
                    let the mounts do the healing. If the mount
                    accesses the VM image either by someone doing
                    operations inside the VM or explicit stat on
                    the file it should do the healing.

                    Pranith.


                    On 08/05/2014 10:39 AM, Roman wrote:

                    Hmmm, you told me to turn it off. Did I
                    understood something wrong? After I issued
                    the command you've sent me, I was not able to
                    watch the healing process, it said, it won't
                    be healed, becouse its turned off.


                    2014-08-05 5:39 GMT+03:00 Pranith Kumar
                    Karampuri <[email protected]
                    <mailto:[email protected]>>:

                        You didn't mention anything about
                        self-healing. Did you wait until the
                        self-heal is complete?

                        Pranith

                        On 08/04/2014 05:49 PM, Roman wrote:

                        Hi!
                        Result is pretty same. I set the switch
                        port down for 1st server, it was ok.
                        Then set it up back and set other
                        server's port off. and it triggered IO
                        error on two virtual machines: one with
                        local root FS but network mounted
                        storage. and other with network root FS.
                        1st gave an error on copying to or from
                        the mounted network disk, other just
                        gave me an error for even reading log.files.

                        cat: /var/log/alternatives.log:
                        Input/output error
                        then I reset the kvm VM and it said me,
                        there is no boot device. Next I
                        virtually powered it off and then back
                        on and it has booted.

                        By the way, did I have to start/stop volume?

                        >> Could you do the following and test
                        it again?
                        >> gluster volume set <volname>
                        cluster.self-heal-daemon off

                        >>Pranith




                        2014-08-04 14:10 GMT+03:00 Pranith Kumar
                        Karampuri <[email protected]
                        <mailto:[email protected]>>:


                            On 08/04/2014 03:33 PM, Roman wrote:

                            Hello!

                            Facing the same problem as
                            mentioned here:

                            
http://supercolony.gluster.org/pipermail/gluster-users/2014-April/039959.html

                            my set up is up and running, so i'm
                            ready to help you back with feedback.

                            setup:
                            proxmox server as client
                            2 gluster physical  servers

                            server side and client side both
                            running atm 3.4.4 glusterfs from
                            gluster repo.

                            the problem is:

                            1. craeted replica bricks.
                            2. mounted in proxmox (tried both
                            promox ways: via GUI and fstab
                            (with backup volume line), btw
                            while mounting via fstab I'm unable
                            to launch a VM without cache,
                            meanwhile direct-io-mode is enabled
                            in fstab line)
                            3. installed VM
                            4. bring one volume down - ok
                            5. bringing up, waiting for sync is
                            done.
                            6. bring other volume down -
                            getting IO errors on VM guest and
                            not able to restore the VM after I
                            reset the VM via host. It says (no
                            bootable media). After I shut it
                            down (forced) and bring back up, it
                            boots.

                            Could you do the following and test
                            it again?
                            gluster volume set <volname>
                            cluster.self-heal-daemon off

                            Pranith


                            Need help. Tried 3.4.3, 3.4.4.
                            Still missing pkg-s for 3.4.5 for
                            debian and 3.5.2 (3.5.1 always
                            gives a healing error for some reason)

--Best regards,

                            Roman.


                            _______________________________________________
                            Gluster-users mailing list
                            [email protected]  
<mailto:[email protected]>
                            
http://supercolony.gluster.org/mailman/listinfo/gluster-users

--Best regards,

                        Roman.

--Best regards,

                    Roman.

--Best regards,

                Roman.

--Best regards,

            Roman.

--Best regards,

        Roman.

--Best regards,

    Roman.





--
Best regards,
Roman.

_______________________________________________
Gluster-users mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] libgfapi failover problem on replica bricks

Reply via email to