Re: [Gluster-users] libgfapi failover problem on replica bricks

Pranith Kumar Karampuri Wed, 06 Aug 2014 05:13:38 -0700


On 08/05/2014 03:07 PM, Roman wrote:

really, seems like the same file
stor1:
a951641c5230472929836f9fcede6b04/exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
stor2:
a951641c5230472929836f9fcede6b04/exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
one thing I've seen from logs, that somehow proxmox VE is connectingwith wrong version to servers?[2014-08-05 09:23:45.218550] I[client-handshake.c:1659:select_server_supported_programs]0-HA-fast-150G-PVE1-client-0: Using Program GlusterFS 3.3, Num(1298437), Version (330)

It is the rpc (over the network data structures) version, which is notchanged at all from 3.3 so thats not a problem. So what is theconclusion? Is your test case working now or not?


Pranith

but if I issue:
root@pve1:~# glusterfs -V
glusterfs 3.4.4 built on Jun 28 2014 03:44:57
seems ok.

server  use 3.4.4 meanwhile

[2014-08-05 09:23:45.117875] I[server-handshake.c:567:server_setvolume] 0-HA-fast-150G-PVE1-server:accepted client fromstor1-9004-2014/08/05-09:23:45:93538-HA-fast-150G-PVE1-client-1-0(version: 3.4.4)[2014-08-05 09:23:49.103035] I[server-handshake.c:567:server_setvolume] 0-HA-fast-150G-PVE1-server:accepted client fromstor1-8998-2014/08/05-09:23:45:89883-HA-fast-150G-PVE1-client-0-0(version: 3.4.4)


if this could be the reason, of course.
I did restart the Proxmox VE yesterday (just for an information)

2014-08-05 12:30 GMT+03:00 Pranith Kumar Karampuri<[email protected] <mailto:[email protected]>>:



    On 08/05/2014 02:33 PM, Roman wrote:

    Waited long enough for now, still different sizes and no logs
    about healing :(

    stor1
    # file: exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
    trusted.afr.HA-fast-150G-PVE1-client-0=0x000000000000000000000000
    trusted.afr.HA-fast-150G-PVE1-client-1=0x000000000000000000000000
    trusted.gfid=0xf10ad81b58484bcd9b385a36a207f921

    root@stor1:~# du -sh /exports/fast-test/150G/images/127/
    1.2G    /exports/fast-test/150G/images/127/


    stor2
    # file: exports/fast-test/150G/images/127/vm-127-disk-1.qcow2
    trusted.afr.HA-fast-150G-PVE1-client-0=0x000000000000000000000000
    trusted.afr.HA-fast-150G-PVE1-client-1=0x000000000000000000000000
    trusted.gfid=0xf10ad81b58484bcd9b385a36a207f921


    root@stor2:~# du -sh /exports/fast-test/150G/images/127/
    1.4G    /exports/fast-test/150G/images/127/

    According to the changelogs, the file doesn't need any healing.
    Could you stop the operations on the VMs and take md5sum on both
    these machines?

    Pranith





    2014-08-05 11:49 GMT+03:00 Pranith Kumar Karampuri
    <[email protected] <mailto:[email protected]>>:


        On 08/05/2014 02:06 PM, Roman wrote:

        Well, it seems like it doesn't see the changes were made to
        the volume ? I created two files 200 and 100 MB (from
        /dev/zero) after I disconnected the first brick. Then
        connected it back and got these logs:

        [2014-08-05 08:30:37.830150] I
        [glusterfsd-mgmt.c:1584:mgmt_getspec_cbk] 0-glusterfs: No
        change in volfile, continuing
        [2014-08-05 08:30:37.830207] I
        [rpc-clnt.c:1676:rpc_clnt_reconfig]
        0-HA-fast-150G-PVE1-client-0: changing port to 49153 (from 0)
        [2014-08-05 08:30:37.830239] W [socket.c:514:__socket_rwv]
        0-HA-fast-150G-PVE1-client-0: readv failed (No data available)
        [2014-08-05 08:30:37.831024] I
        [client-handshake.c:1659:select_server_supported_programs]
        0-HA-fast-150G-PVE1-client-0: Using Program GlusterFS 3.3,
        Num (1298437), Version (330)
        [2014-08-05 08:30:37.831375] I
        [client-handshake.c:1456:client_setvolume_cbk]
        0-HA-fast-150G-PVE1-client-0: Connected to 10.250.0.1:49153
        <http://10.250.0.1:49153>, attached to remote volume
        '/exports/fast-test/150G'.
        [2014-08-05 08:30:37.831394] I
        [client-handshake.c:1468:client_setvolume_cbk]
        0-HA-fast-150G-PVE1-client-0: Server and Client lk-version
        numbers are not same, reopening the fds
        [2014-08-05 08:30:37.831566] I
        [client-handshake.c:450:client_set_lk_version_cbk]
        0-HA-fast-150G-PVE1-client-0: Server lk version = 1


        [2014-08-05 08:30:37.830150] I
        [glusterfsd-mgmt.c:1584:mgmt_getspec_cbk] 0-glusterfs: No
        change in volfile, continuing
        this line seems weird to me tbh.
        I do not see any traffic on switch interfaces between
        gluster servers, which means, there is no syncing between them.
        I tried to ls -l the files on the client and servers to
        trigger the healing, but seems like no success. Should I
        wait more?

        Yes, it should take around 10-15 minutes. Could you provide
        'getfattr -d -m. -e hex <file-on-brick>' on both the bricks.

        Pranith



        2014-08-05 11:25 GMT+03:00 Pranith Kumar Karampuri
        <[email protected] <mailto:[email protected]>>:


            On 08/05/2014 01:10 PM, Roman wrote:

            Ahha! For some reason I was not able to start the VM
            anymore, Proxmox VE told me, that it is not able to
            read the qcow2 header due to permission is denied for
            some reason. So I just deleted that file and created a
            new VM. And the nex message I've got was this:

            Seems like these are the messages where you took down
            the bricks before self-heal. Could you restart the run
            waiting for self-heals to complete before taking down
            the next brick?

            Pranith



            [2014-08-05 07:31:25.663412] E
            [afr-self-heal-common.c:197:afr_sh_print_split_brain_log]
            0-HA-fast-150G-PVE1-replicate-0: Unable to self-heal
            contents of '/images/124/vm-124-disk-1.qcow2' (possible
            split-brain). Please delete the file from all but the
            preferred subvolume.- Pending matrix:  [ [ 0 60 ] [ 11
            0 ] ]
            [2014-08-05 07:31:25.663955] E
            [afr-self-heal-common.c:2262:afr_self_heal_completion_cbk]
            0-HA-fast-150G-PVE1-replicate-0: background  data
            self-heal failed on /images/124/vm-124-disk-1.qcow2



            2014-08-05 10:13 GMT+03:00 Pranith Kumar Karampuri
            <[email protected] <mailto:[email protected]>>:

                I just responded to your earlier mail about how the
                log looks. The log comes on the mount's logfile

                Pranith

                On 08/05/2014 12:41 PM, Roman wrote:

                Ok, so I've waited enough, I think. Had no any
                traffic on switch ports between servers. Could not
                find any suitable log message about completed
                self-heal (waited about 30 minutes). Plugged out
                the other server's UTP cable this time and got in
                the same situation:
                root@gluster-test1:~# cat /var/log/dmesg
                -bash: /bin/cat: Input/output error

                brick logs:
                [2014-08-05 07:09:03.005474] I
                [server.c:762:server_rpc_notify]
                0-HA-fast-150G-PVE1-server: disconnecting
                connectionfrom
                
pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0
                [2014-08-05 07:09:03.005530] I
                [server-helpers.c:729:server_connection_put]
                0-HA-fast-150G-PVE1-server: Shutting down
                connection
                
pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0
                [2014-08-05 07:09:03.005560] I
                [server-helpers.c:463:do_fd_cleanup]
                0-HA-fast-150G-PVE1-server: fd cleanup on
                /images/124/vm-124-disk-1.qcow2
                [2014-08-05 07:09:03.005797] I
                [server-helpers.c:617:server_connection_destroy]
                0-HA-fast-150G-PVE1-server: destroyed connection
                of
                
pve1-27649-2014/08/04-13:27:54:720789-HA-fast-150G-PVE1-client-0-0





                2014-08-05 9:53 GMT+03:00 Pranith Kumar Karampuri
                <[email protected] <mailto:[email protected]>>:

                    Do you think it is possible for you to do
                    these tests on the latest version 3.5.2?
                    'gluster volume heal <volname> info' would
                    give you that information in versions > 3.5.1.
                    Otherwise you will have to check it from
                    either the logs, there will be self-heal
                    completed message on the mount logs (or) by
                    observing 'getfattr -d -m. -e hex
                    <image-file-on-bricks>'

                    Pranith


                    On 08/05/2014 12:09 PM, Roman wrote:

                    Ok, I understand. I will try this shortly.
                    How can I be sure, that healing process is
                    done, if I am not able to see its status?


                    2014-08-05 9:30 GMT+03:00 Pranith Kumar
                    Karampuri <[email protected]
                    <mailto:[email protected]>>:

                        Mounts will do the healing, not the
                        self-heal-daemon. The problem I feel is
                        that whichever process does the healing
                        has the latest information about the good
                        bricks in this usecase. Since for VM
                        usecase, mounts should have the latest
                        information, we should let the mounts do
                        the healing. If the mount accesses the VM
                        image either by someone doing operations
                        inside the VM or explicit stat on the
                        file it should do the healing.

                        Pranith.


                        On 08/05/2014 10:39 AM, Roman wrote:

                        Hmmm, you told me to turn it off. Did I
                        understood something wrong? After I
                        issued the command you've sent me, I was
                        not able to watch the healing process,
                        it said, it won't be healed, becouse its
                        turned off.


                        2014-08-05 5:39 GMT+03:00 Pranith Kumar
                        Karampuri <[email protected]
                        <mailto:[email protected]>>:

                            You didn't mention anything about
                            self-healing. Did you wait until the
                            self-heal is complete?

                            Pranith

                            On 08/04/2014 05:49 PM, Roman wrote:

                            Hi!
                            Result is pretty same. I set the
                            switch port down for 1st server, it
                            was ok. Then set it up back and set
                            other server's port off. and it
                            triggered IO error on two virtual
                            machines: one with local root FS
                            but network mounted storage. and
                            other with network root FS. 1st
                            gave an error on copying to or from
                            the mounted network disk, other
                            just gave me an error for even
                            reading log.files.

                            cat: /var/log/alternatives.log:
                            Input/output error
                            then I reset the kvm VM and it said
                            me, there is no boot device. Next I
                            virtually powered it off and then
                            back on and it has booted.

                            By the way, did I have to
                            start/stop volume?

                            >> Could you do the following and
                            test it again?
                            >> gluster volume set <volname>
                            cluster.self-heal-daemon off

                            >>Pranith




                            2014-08-04 14:10 GMT+03:00 Pranith
                            Kumar Karampuri
                            <[email protected]
                            <mailto:[email protected]>>:


                                On 08/04/2014 03:33 PM, Roman
                                wrote:

                                Hello!

                                Facing the same problem as
                                mentioned here:

                                
http://supercolony.gluster.org/pipermail/gluster-users/2014-April/039959.html

                                my set up is up and running,
                                so i'm ready to help you back
                                with feedback.

                                setup:
                                proxmox server as client
                                2 gluster physical  servers

                                server side and client side
                                both running atm 3.4.4
                                glusterfs from gluster repo.

                                the problem is:

                                1. craeted replica bricks.
                                2. mounted in proxmox (tried
                                both promox ways: via GUI and
                                fstab (with backup volume
                                line), btw while mounting via
                                fstab I'm unable to launch a
                                VM without cache, meanwhile
                                direct-io-mode is enabled in
                                fstab line)
                                3. installed VM
                                4. bring one volume down - ok
                                5. bringing up, waiting for
                                sync is done.
                                6. bring other volume down -
                                getting IO errors on VM guest
                                and not able to restore the VM
                                after I reset the VM via host.
                                It says (no bootable media).
                                After I shut it down (forced)
                                and bring back up, it boots.

                                Could you do the following and
                                test it again?
                                gluster volume set <volname>
                                cluster.self-heal-daemon off

                                Pranith


                                Need help. Tried 3.4.3, 3.4.4.
                                Still missing pkg-s for 3.4.5
                                for debian and 3.5.2 (3.5.1
                                always gives a healing error
                                for some reason)

--Best regards,

                                Roman.


                                _______________________________________________
                                Gluster-users mailing list
                                [email protected]  
<mailto:[email protected]>
                                
http://supercolony.gluster.org/mailman/listinfo/gluster-users

--Best regards,

                            Roman.

--Best regards,

                        Roman.

--Best regards,

                    Roman.

--Best regards,

                Roman.

--Best regards,

            Roman.

--Best regards,

        Roman.

--Best regards,

    Roman.





--
Best regards,
Roman.

_______________________________________________
Gluster-users mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] libgfapi failover problem on replica bricks

Reply via email to