Re: [Gluster-users] Does replace-brick migrate data?

Ravishankar N Mon, 03 Jun 2019 09:40:22 -0700


On 01/06/19 9:37 PM, Alan Orth wrote:

Dear Ravi,
The .glusterfs hardlinks/symlinks should be fine. I'm not sure how Icould verify them for six bricks and millions of files, though... :\


Hi Alan,

The reason I asked this is because you had mentioned in one of yourearlier emails that when you moved content from the old brick to the newone, you had skipped the .glusterfs directory. So I was assuming thatwhen you added back this new brick to the cluster, it might have beenmissing the .glusterfs entries. If that is the cae, one way to verifycould be to check using a script if all files on the brick have alink-count of at least 2 and all dirs have valid symlinks inside.glusterfs pointing to themselves.

I had a small success in fixing some issues with duplicated files onthe FUSE mount point yesterday. I read quite a bit about the elastichashing algorithm that determines which files get placed on whichbricks based on the hash of their filename and thetrusted.glusterfs.dht xattr on brick directories (thanks to JoeJulian's blog post and Python script for showing how it works¹). Withthat knowledge I looked closer at one of the files that was appearingas duplicated on the FUSE mount and found that it was also duplicatedon more than `replica 2` bricks. For this particular file I found two"real" files and several zero-size files withtrusted.glusterfs.dht.linkto xattrs. Neither of the "real" files wereon the correct brick as far as the DHT layout is concerned, so Icopied one of them to the correct brick, deleted the others and theirhard links, and did a `stat` on the file from the FUSE mount point andit fixed itself. Yay!
Could this have been caused by a replace-brick that got interruptedand didn't finish re-labeling the xattrs?

No, replace-brick only initiates AFR self-heal, which just copies thecontents from the other brick(s) of the *same* replica pair into thereplaced brick. The link-to files are created by DHT when you rename afile from the client. If the new name hashes to a different brick, DHTdoes not move the entire file there. It instead creates the link-to file(the one with the dht.linkto xattrs) on the hashed subvol. The value ofthis xattr points to the brick where the actual data is there (`getfattr-e text` to see it for yourself). Perhaps you had attempted a rebalanceor remove-brick earlier and interrupted that?

Should I be thinking of some heuristics to identify and fix theseissues with a script (incorrect brick placement), or is this somethinga fix layout or repeated volume heals can fix? I've already completeda whole heal on this particular volume this week and it did heal about1,000,000 files (mostly data and metadata, but about 20,000 entryheals as well).

Maybe you should let the AFR self-heals complete first and then attempta full rebalance to take care of the dht link-to files. But if thefiles are in millions, it could take quite some time to complete.


Regards,
Ravi

Thanks for your support,

¹ https://joejulian.name/post/dht-misses-are-expensive/

On Fri, May 31, 2019 at 7:57 AM Ravishankar N <[email protected]<mailto:[email protected]>> wrote:



    On 31/05/19 3:20 AM, Alan Orth wrote:

    Dear Ravi,

    I spent a bit of time inspecting the xattrs on some files and
    directories on a few bricks for this volume and it looks a bit
    messy. Even if I could make sense of it for a few and potentially
    heal them manually, there are millions of files and directories
    in total so that's definitely not a scalable solution. After a
    few missteps with `replace-brick ... commit force` in the last
    week—one of which on a brick that was dead/offline—as well as
    some premature `remove-brick` commands, I'm unsure how how to
    proceed and I'm getting demotivated. It's scary how quickly
    things get out of hand in distributed systems...

    Hi Alan,
    The one good thing about gluster is it that the data is always
    available directly on the backed bricks even if your volume has
    inconsistencies at the gluster level. So theoretically, if your
    cluster is FUBAR, you could just create a new volume and copy all
    data onto it via its mount from the old volume's bricks.


    I had hoped that bringing the old brick back up would help, but
    by the time I added it again a few days had passed and all the
    brick-id's had changed due to the replace/remove brick commands,
    not to mention that the trusted.afr.$volume-client-xx values were
    now probably pointing to the wrong bricks (?).

    Anyways, a few hours ago I started a full heal on the volume and
    I see that there is a sustained 100MiB/sec of network traffic
    going from the old brick's host to the new one. The completed
    heals reported in the logs look promising too:

    Old brick host:

    # grep '2019-05-30' /var/log/glusterfs/glustershd.log | grep -o
    -E 'Completed (data|metadata|entry) selfheal' | sort | uniq -c
     281614 Completed data selfheal
         84 Completed entry selfheal
     299648 Completed metadata selfheal

    New brick host:

    # grep '2019-05-30' /var/log/glusterfs/glustershd.log | grep -o
    -E 'Completed (data|metadata|entry) selfheal' | sort | uniq -c
     198256 Completed data selfheal
      16829 Completed entry selfheal
     229664 Completed metadata selfheal

    So that's good I guess, though I have no idea how long it will
    take or if it will fix the "missing files" issue on the FUSE
    mount. I've increased cluster.shd-max-threads to 8 to hopefully
    speed up the heal process.

    The afr xattrs should not cause files to disappear from mount. If
    the xattr names do not match what each AFR subvol expects (for eg.
    in a replica 2 volume, trusted.afr.*-client-{0,1} for 1st subvol,
    client-{2,3} for 2nd subvol and so on - ) for its children then it
    won't heal the data, that is all. But in your case I see some
    inconsistencies like one brick having the actual file
    (licenseserver.cfg) and the other having a linkto file (the one
    with thedht.linkto xattr) /in the same replica pair/.


    I'd be happy for any advice or pointers,


    Did you check if the .glusterfs hardlinks/symlinks exist and are
    in order for all bricks?

    -Ravi


    On Wed, May 29, 2019 at 5:20 PM Alan Orth <[email protected]
    <mailto:[email protected]>> wrote:

        Dear Ravi,

        Thank you for the link to the blog post series—it is very
        informative and current! If I understand your blog post
        correctly then I think the answer to your previous question
        about pending AFRs is: no, there are no pending AFRs. I have
        identified one file that is a good test case to try to
        understand what happened after I issued the `gluster volume
        replace-brick ... commit force` a few days ago and then added
        the same original brick back to the volume later. This is the
        current state of the replica 2 distribute/replicate volume:

        [root@wingu0 ~]# gluster volume info apps

        Volume Name: apps
        Type: Distributed-Replicate
        Volume ID: f118d2da-79df-4ee1-919d-53884cd34eda
        Status: Started
        Snapshot Count: 0
        Number of Bricks: 3 x 2 = 6
        Transport-type: tcp
        Bricks:
        Brick1: wingu3:/mnt/gluster/apps
        Brick2: wingu4:/mnt/gluster/apps
        Brick3: wingu05:/data/glusterfs/sdb/apps
        Brick4: wingu06:/data/glusterfs/sdb/apps
        Brick5: wingu0:/mnt/gluster/apps
        Brick6: wingu05:/data/glusterfs/sdc/apps
        Options Reconfigured:
        diagnostics.client-log-level: DEBUG
        storage.health-check-interval: 10
        nfs.disable: on

        I checked the xattrs of one file that is missing from the
        volume's FUSE mount (though I can read it if I access its
        full path explicitly), but is present in several of the
        volume's bricks (some with full size, others empty):

        [root@wingu0 ~]# getfattr -d -m. -e hex
        /mnt/gluster/apps/clcgenomics/clclicsrv/licenseserver.cfg

        getfattr: Removing leading '/' from absolute path names #
        file:
        mnt/gluster/apps/clcgenomics/clclicsrv/licenseserver.cfg
        
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
        trusted.afr.apps-client-3=0x000000000000000000000000
        trusted.afr.apps-client-5=0x000000000000000000000000
        trusted.afr.dirty=0x000000000000000000000000
        trusted.bit-rot.version=0x0200000000000000585a396f00046e15
        trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd [root@wingu05
        ~]# getfattr -d -m. -e hex
        /data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg
        getfattr: Removing leading '/' from absolute path names #
        file:
        data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg
        
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
        trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd
        
trusted.gfid2path.82586deefbc539c3=0x34666437323861612d356462392d343836382d616232662d6564393031636566333561392f6c6963656e73657365727665722e636667
        trusted.glusterfs.dht.linkto=0x617070732d7265706c69636174652d3200
        [root@wingu05 ~]# getfattr -d -m. -e hex
        /data/glusterfs/sdc/apps/clcgenomics/clclicsrv/licenseserver.cfg
        getfattr: Removing leading '/' from absolute path names #
        file:
        data/glusterfs/sdc/apps/clcgenomics/clclicsrv/licenseserver.cfg
        
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
        trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd
        
trusted.gfid2path.82586deefbc539c3=0x34666437323861612d356462392d343836382d616232662d6564393031636566333561392f6c6963656e73657365727665722e636667
        [root@wingu06 ~]# getfattr -d -m. -e hex
        /data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg
        getfattr: Removing leading '/' from absolute path names #
        file:
        data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg
        
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
        trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd
        
trusted.gfid2path.82586deefbc539c3=0x34666437323861612d356462392d343836382d616232662d6564393031636566333561392f6c6963656e73657365727665722e636667
        trusted.glusterfs.dht.linkto=0x617070732d7265706c69636174652d3200

        According to the trusted.afr.apps-client-xxxattrs this
        particular file should be on bricks with id "apps-client-3"
        and "apps-client-5". It took me a few hours to realize that
        the brick-id values are recorded in the volume's volfiles in
        /var/lib/glusterd/vols/apps/bricks. After comparing those
        brick-id values with a volfile backup from before the
        replace-brick, I realized that the files are simply on the
        wrong brick now as far as Gluster is concerned. This
        particular file is now on the brick for "apps-client-4". As
        an experiment I copied this one file to the two bricks listed
        in the xattrs and I was then able to see the file from the
        FUSE mount (yay!).

        Other than replacing the brick, removing it, and then adding
        the old brick on the original server back, there has been no
        change in the data this entire time. Can I change the brick
        IDs in the volfiles so they reflect where the data actually
        is? Or perhaps script something to reset all the xattrs on
        the files/directories to point to the correct bricks?

        Thank you for any help or pointers,

        On Wed, May 29, 2019 at 7:24 AM Ravishankar N
        <[email protected] <mailto:[email protected]>> wrote:


            On 29/05/19 9:50 AM, Ravishankar N wrote:



            On 29/05/19 3:59 AM, Alan Orth wrote:

            Dear Ravishankar,

            I'm not sure if Brick4 had pending AFRs because I don't
            know what that means and it's been a few days so I am
            not sure I would be able to find that information.

            When you find some time, have a look at a blog
            <http://wp.me/peiBB-6b> series I wrote about AFR- I've
            tried to explain what one needs to know to debug
            replication related issues in it.


            Made a typo error. The URL for the blog is
            https://wp.me/peiBB-6b

            -Ravi


            Anyways, after wasting a few days rsyncing the old
            brick to a new host I decided to just try to add the
            old brick back into the volume instead of bringing it
            up on the new host. I created a new brick directory on
            the old host, moved the old brick's contents into that
            new directory (minus the .glusterfs directory), added
            the new brick to the volume, and then did Vlad's
            find/stat trick¹ from the brick to the FUSE mount point.

            The interesting problem I have now is that some files
            don't appear in the FUSE mount's directory listings,
            but I can actually list them directly and even read
            them. What could cause that?

            Not sure, too many variables in the hacks that you did
            to take a guess. You can check if the contents of the
            .glusterfs folder are in order on the new brick (example
            hardlink for files and symlinks for directories are
            present etc.) .
            Regards,
            Ravi


            Thanks,

            ¹
            
https://lists.gluster.org/pipermail/gluster-users/2018-February/033584.html

            On Fri, May 24, 2019 at 4:59 PM Ravishankar N
            <[email protected]
            <mailto:[email protected]>> wrote:


                On 23/05/19 2:40 AM, Alan Orth wrote:

                Dear list,

                I seem to have gotten into a tricky situation.
                Today I brought up a shiny new server with new
                disk arrays and attempted to replace one brick of
                a replica 2 distribute/replicate volume on an
                older server using the `replace-brick` command:

                # gluster volume replace-brick homes
                wingu0:/mnt/gluster/homes
                wingu06:/data/glusterfs/sdb/homes commit force

                The command was successful and I see the new brick
                in the output of `gluster volume info`. The
                problem is that Gluster doesn't seem to be
                migrating the data,


                `replace-brick` definitely must heal (not migrate)
                the data. In your case, data must have been healed
                from Brick-4 to the replaced Brick-3. Are there any
                errors in the self-heal daemon logs of Brick-4's
                node? Does Brick-4 have pending AFR xattrs blaming
                Brick-3? The doc is a bit out of date.
                replace-brick command internally does all the
                setfattr steps that are mentioned in the doc.

                -Ravi

                and now the original brick that I replaced is no
                longer part of the volume (and a few terabytes of
                data are just sitting on the old brick):

                # gluster volume info homes | grep -E "Brick[0-9]:"
                Brick1: wingu4:/mnt/gluster/homes
                Brick2: wingu3:/mnt/gluster/homes
                Brick3: wingu06:/data/glusterfs/sdb/homes
                Brick4: wingu05:/data/glusterfs/sdb/homes
                Brick5: wingu05:/data/glusterfs/sdc/homes
                Brick6: wingu06:/data/glusterfs/sdc/homes

                I see the Gluster docs have a more complicated
                procedure for replacing bricks that involves
                getfattr/setfattr¹. How can I tell Gluster about
                the old brick? I see that I have a backup of the
                old volfile thanks to yum's rpmsave function if
                that helps.

                We are using Gluster 5.6 on CentOS 7. Thank you
                for any advice you can give.

                ¹
                
https://docs.gluster.org/en/latest/Administrator%20Guide/Managing%20Volumes/#replace-faulty-brick

--Alan Orth

                [email protected] <mailto:[email protected]>
                https://picturingjordan.com
                https://englishbulgaria.net
                https://mjanja.ch
                "In heaven all the interesting people are
                missing." ―Friedrich Nietzsche

                _______________________________________________
                Gluster-users mailing list
                [email protected]  <mailto:[email protected]>
                https://lists.gluster.org/mailman/listinfo/gluster-users

--Alan Orth

            [email protected] <mailto:[email protected]>
            https://picturingjordan.com
            https://englishbulgaria.net
            https://mjanja.ch
            "In heaven all the interesting people are missing."
            ―Friedrich Nietzsche


            _______________________________________________
            Gluster-users mailing list
            [email protected]  <mailto:[email protected]>
            https://lists.gluster.org/mailman/listinfo/gluster-users

--Alan Orth

        [email protected] <mailto:[email protected]>
        https://picturingjordan.com
        https://englishbulgaria.net
        https://mjanja.ch
        "In heaven all the interesting people are missing."
        ―Friedrich Nietzsche

--Alan Orth

    [email protected] <mailto:[email protected]>
    https://picturingjordan.com
    https://englishbulgaria.net
    https://mjanja.ch
    "In heaven all the interesting people are missing." ―Friedrich
    Nietzsche




--
Alan Orth
[email protected] <mailto:[email protected]>
https://picturingjordan.com
https://englishbulgaria.net
https://mjanja.ch
"In heaven all the interesting people are missing." ―Friedrich Nietzsche

_______________________________________________
Gluster-users mailing list
[email protected]
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Does replace-brick migrate data?

Reply via email to