Re: [Gluster-users] Targeted fix-layout?

Dan Bretherton Mon, 28 Jan 2013 06:02:13 -0800

On 01/16/2013 02:56 PM, Dan Bretherton wrote:
On 01/15/2013 08:17 PM, [email protected] wrote:
Date: Tue, 15 Jan 2013 15:17:00 -0500
From: Jeff Darcy <[email protected]>
To: [email protected]
Subject: Re: [Gluster-users] Targeted fix-layout?
Message-ID: <[email protected]>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
On 01/15/2013 01:10 PM, Dan Bretherton wrote:
I am running a fix-layout operation on a volume after seeing errorsmentioning"anomalies" and "holes" in the logs. There is a particulardirectory that isgiving trouble and I would like to be able to run the layout fix onthatfirst. Users are experiencing various I/O errors including"invalid argument"
and "Unknown error 526", but after running for a week the volume wide
fix-layout doesn't seem to have reached this particular directory yet.
Fix-layout takes a long time because there are millions of files inthe volumeand the CPU load is consistently very high on all the servers whileit isrunning, sometimes over 20. Therefore I really need to find a wayto target
particular directories or speed up the volume wide fix-layout.
You should be able to do the following command on a client to fixthe layoutfor just one directory (it's the same xattr used by the rebalancecommand).
    setfattr -n distribute.fix.layout -v "anything" /bad/directory
I have no idea what caused these errors but it could be related tothe previousfix-layout operation, which I started following the addition of anew pair ofbricks, not having completed successfully. The problem is that therebalanceoperation on one or more servers often fails before completing andthere is noway (that I know of) to restart or resume the process on oneserver. Everytime this happens I stop the fix-layout and start it again, but ithas nevercompleted successfully on every server despite sometimes runningfor several
weeks.
One other possible cause I can think of is my recent policy ofusing XFS fornew bricks instead of ext4. The reason I think this might becausing theproblem is that none of the other volumes have any XFS bricks yetand theyaren't experiencing any I/O errors. Are there any special mountoptionsrequired for XFS, and is there any reason why a volume shouldn'tcontain a
mixture of ext4 and XFS bricks?
It doesn't seem like that should be a problem, but maybe someoneelse knows
something about ext4/XFS differences that could shed some light.
Thanks Jeff, I'll give that a try.
Should the xattr name be trusted.distribute.fix.layout by the way?When I try with distribute.fix.layout I get the error "Operation notsupported".
-Dan.
On 01/16/2013 09:56 AM, Dan Bretherton wrote:
>/  Should the xattr name be trusted.distribute.fix.layout by the way? When
/>/  I try with distribute.fix.layout I get the error "Operation not supported".
/
I just re-examined and re-ran the code, and distribute.fix.layout does
seem to be correct.  You're doing this on the client side, right?  The
other thing to check would be versions, since I hardly ever run a
version that's more than a day old and that often means I'm using
features that haven't made it into a release yet.  I think that one has
been there for a while, though.

Thanks for checking the code for me. I wasn't doing it on the client -thanks for pointing out my mistake. I have tested he targeted fixlayout on a test volume and verified that there weren't any detrimentaleffects, but I don't know how to reproduce the layout errors I am seeingin the production volume in order find out if the targeted layout fixactually works. Unfortunately I haven't been able to try it on theproduction volume because the owner doesn't want to risk it. He is alsoconcerned about performance deteriorating any more, given that thevolume wide layout fix is still going and still slowing things down a lot.

I had to extend another volume a couple of weeks ago and the layout fixfor that one is now running at the same time. One server now has a loadof over 70 most of the time (mostly glusterfsd), but none of the othersseem to be particularly busy. I restarted the server in question butthe CPU load quickly went up to 70 again. I can't see any particularreason why this one server should be so badly affected by the layoutfixing processes. It isn't a particularly big server, with only five3TB bricks involved in the two volumes that were extended. I can't helpthinking that this is the reason why the volume layout fixes are takingsuch a long time, even though the rebalance processes run on all theservers simultaneously. Can anyone suggest a way to troubleshoot thisproblem? The rebalance logs don't show anything unusual butglustershd.log has a lot of metadata split-brain warnings. The bricklogs are full of scary looking warnings but none flagged 'E' or 'C'.The trouble is that I see messages like these on all the servers, and Ican find nothing unusual about the server with a CPU load of 70.


-Dan.

_______________________________________________
Gluster-users mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Targeted fix-layout?

Reply via email to