Re: [Gluster-users] Failed rebalance resulting in major problems

Joe Julian Wed, 06 Nov 2013 12:16:17 -0800

On 11/06/2013 11:52 AM, Justin Dossey wrote:

Shawn,
I had a very similar experience with a rebalance on 3.3.1, and it tookweeks to get everything straightened out. I would be happy to sharethe scripts I wrote to correct the permissions issues if you wish,though I'm not sure it would be appropriate to share them directly onthis list. Perhaps I should just create a project on Github that isdevoted to collecting scripts people use to fix their GlusterFSenvironments!
After that (awful) experience, I am loath to run further rebalances.I've even spent days evaluating alternatives to GlusterFS, as myexperience with this list over the last six months indicates thatsupport for community users is minimal, even in the face of major bugssuch as the one with rebalancing and the continuing "gfid different onsubvolume" bugs with 3.3.2.

I'm one of oldest GlusterFS users around here and one of the biggestproponents and even I have been loath to rebalance until 3.4.1.

There are no open bugs for gfid mismatches that I could find. The lasttime someone mentioned that error in IRC it was 2am, I was at aconvention, and I told the user how to solve that problem (http://irclog.perlgeek.de/gluster/2013-06-14#i_7196149 ). It was causedby split-brain. If you have a bug, it would be more productive to fileit rather than make negative comments about a community of people thathave no requirement to help anybody, but do it anyway just becausethey're nice people.

This is going to sound snarky because it's in text, but I mean thissincerely. If community support is not sufficient, you might considerpurchasing support from a company that provides it professionally.

Let me know what you think of the Github thing and I'll proceedappropriately.

Even better, put them up on http://forge.gluster.org

On Tue, Nov 5, 2013 at 9:05 PM, Shawn Heisey <[email protected]<mailto:[email protected]>> wrote:


    We recently added storage servers to our gluster install, running
    3.3.1
    on CentOS 6.  It went from 40TB usable (8x2 distribute-replicate) to
    80TB usable (16x2).  There was a little bit over 20TB used space
    on the
    volume.

    The add-brick went through without incident, but the rebalance failed
    after moving 1.5TB of the approximately 10TB that needed to be
    moved.  A
    side issue is that it took four days for that 1.5TB to move.  I'm
    aware
    that gluster has overhead, and that there's only so much speed you can
    get out of gigabit, but a 100Mb/s half-duplex link could have
    copied the
    data faster if it had been a straight copy.

    After I discovered that the rebalance had failed, I noticed that there
    were other problems.  There are a small number of completely lost
    files
    (91 that I know about so far), a huge number of permission issues
    (over
    800,000 files changed to 000), and about 32000 files that are throwing
    read errors via the fuse/nfs mount but seem to be available
    directly on
    bricks.  That last category of problem file has the sticky bit
    set, with
    almost all of them having ---------T permissions.  The good files on
    bricks typically have the same permissions, but are readable by
    root.  I
    haven't worked out the scripting necessary to automate all the fixing
    that needs to happen yet.

    We really need to know what happened.  We do plan to upgrade to 3.4.1,
    but there were some reasons that we didn't want to upgrade before
    adding
    storage.

    * Upgrading will result in service interruption to our clients, which
    mount via NFS.  It would likely be just a hiccup, with quick failover,
    but it's still a service interruption.
    * We have a pacemaker cluster providing the shared IP address for NFS
    mounting.  It's running CentOS 6.3.  A "yum upgrade" to upgrade
    gluster
    will also upgrade to CentOS 6.4.  The pacemaker in 6.4 is incompatible
    with the pacemaker in 6.3, which will likely result in
    longer-than-expected downtime for the shared IP address.
    * We didn't want to risk potential problems with running gluster 3.3.1
    on the existing servers and 3.4.1 on the new servers.
    * We needed the new storage added right away, before we could schedule
    maintenance to deal with the upgrade issues.

    Something that would be extremely helpful would be obtaining the
    services of an expert-level gluster consultant who can look over
    everything we've done to see if there is anything we've done wrong and
    how we might avoid problems in the future.  I don't know how much the
    company can authorize for this, but we obviously want it to be as
    cheap
    as possible.  We are in Salt Lake City, UT, USA.  It would be
    preferable
    to have the consultant be physically present at our location.

    I'm working on redacting one bit of identifying info from our
    rebalance
    log, then I can put it up on dropbox for everyone to examine.

    Thanks,
    Shawn

    _______________________________________________
    Gluster-users mailing list
    [email protected] <mailto:[email protected]>
    http://supercolony.gluster.org/mailman/listinfo/gluster-users




--
Justin Dossey
CTO, PodOmatic



_______________________________________________
Gluster-users mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Failed rebalance resulting in major problems

Reply via email to