On 11/29/2012 07:26 AM, Gerald Brandt wrote:
> How about an option to throttle/limit the self heal speed? DRBD has a speed
> limit, which very effectively cuts down on the resources needed.
>
The following commit was added recently to provide something along those
lines, provided io-threads is in the server stack:
d8fbd9ec2a674c5bfa80d975dfb328674053f82f perf/io-threads:
least-rate-limit least priority throttling
... a primary difference being this is slightly more crude (I consider
it a debug tool) and measured in operations as opposed to limiting
against a measure of a particular resource. You should be able to get a
measure of the current value ("cached least rate") from a state dump and
run 'gluster volume set performance.least-rate-limit X' where X is some
value relative to the output.
Brian
> That being said, I have not had a problem with self heal on my VM images.
> Just two days ago, I deleted all images from one brick and let the self heal
> put everything back, rebuilding the entire brick while VM's were running,
> during business hours (Disk failure force me to do it).
>
> Gerald
>
> ----- Original Message -----
>> From: "Joe Julian" <[email protected]>
>> To: [email protected]
>> Sent: Thursday, November 29, 2012 12:37:37 AM
>> Subject: Re: [Gluster-users] Self healing of 3.3.0 cause our 2 bricks
>> replicated cluster freeze (client read/write
>> timeout)
>>
>> Ok listen up everybody. What you're experiencing is not that self
>> heal
>> is a blocking operation. You're either running out of bandwidth,
>> processor, bus... Whatever it is, it's not that.
>>
>> That was fixed in commit 1af420c700fbc49b65cf7faceb3270e81cd991ce.
>>
>> So please, get it out of your head that this is just that the feature
>> was never added. It was. It's been tested successfully by many admins
>> on
>> many different systems.
>>
>> Once it's out of your head that it's a missing feature, PLEASE try to
>> figure out why YOUR system is showing the behavior that you're
>> experiencing. I can't do it. It's not failing for me. Then file a bug
>> report explaining it so these very intelligent guys can figure out a
>> solution. I've seen how that works. When Avati sees a problem, he'll
>> be
>> sitting on the floor in a hallway because it has WiFi and an outlet
>> and
>> he won't even notice that everyone else has gone to lunch, come back,
>> gone to several panels, come back again, and that the expo hall is
>> starting to clear out because the place is closing. He's focused and
>> dedicated. All these guys are very talented and understand this stuff
>> better than I ever can. They will fix the bug if it can be
>> identified.
>>
>> The first step is finding the actual problem instead of pointing to
>> something that you're just guessing isn't there.
>>
>> On 11/28/2012 09:24 PM, ZHANG Cheng wrote:
>>> I dig out an gluster-users m-list thread dated 2011-June at
>>> http://gluster.org/pipermail/gluster-users/2011-June/008111.html.
>>>
>>> In this post, Marco Agostini said:
>>> ==================================================
>>> Craig Carl said me, three days ago:
>>> ------------------------------------------------------
>>> that happens because Gluster's self heal is a blocking operation.
>>> We
>>> are working on a non-blocking self heal, we are hoping to ship it
>>> in
>>> early September.
>>> ------------------------------------------------------
>>> ==================================================
>>>
>>> Looks like even with release of 3.3.1, self heal is still a
>>> blocking
>>> operation. I am wondering why the official Administration Guide
>>> doesn't warn the reader about such important thing regarding
>>> production operation.
>>>
>>>
>>> On Mon, Nov 26, 2012 at 5:46 PM, ZHANG Cheng <[email protected]>
>>> wrote:
>>>> Early this morning our 2 bricks replicated cluster had an outage.
>>>> The
>>>> disk space for one of the brick server (brick02) was used up. When
>>>> we
>>>> responded to the disk full alert, the issue already lasted for a
>>>> few
>>>> hours. We reclaimed some disk space, and reboot the brick02
>>>> server,
>>>> expecting once it come back it will go self healing.
>>>>
>>>> It did go self healing, but just after couple minutes, access to
>>>> gluster filesystem freeze. Tons of "nfs: server brick not
>>>> responding,
>>>> still trying" popped up in dmesg. The load average on app server
>>>> went
>>>> up to 200 something from usual 0.10. We had to shutdown brick02
>>>> server
>>>> or stop gluster server process on it, to get the gluster cluster
>>>> back
>>>> working.
>>>>
>>>> How could we deal with this issue? Thanks in advance.
>>>>
>>>> Our gluster setup is followed the official doc.
>>>>
>>>> gluster> volume info
>>>>
>>>> Volume Name: staticvol
>>>> Type: Replicate
>>>> Volume ID: fdcbf635-5faf-45d6-ab4e-be97c74d7715
>>>> Status: Started
>>>> Number of Bricks: 1 x 2 = 2
>>>> Transport-type: tcp
>>>> Bricks:
>>>> Brick1: brick01:/exports/static
>>>> Brick2: brick02:/exports/static
>>>>
>>>> Underlying filesystem is xfs (on a lvm volume), as:
>>>> /dev/mapper/vg_node-brick on /exports/static type xfs
>>>> (rw,noatime,nodiratime,nobarrier,logbufs=8)
>>>>
>>>> The brick servers don't act as gluster client.
>>>>
>>>> Our app servers are the gluster client, mount via nfs.
>>>> brick:/staticvol on /mnt/gfs-static type nfs
>>>> (rw,noatime,nodiratime,vers=3,rsize=8192,wsize=8192,addr=10.10.10.51)
>>>>
>>>> brick is a DNS round-robin record for brick01 and brick02.
>>> _______________________________________________
>>> Gluster-users mailing list
>>> [email protected]
>>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>>
>> _______________________________________________
>> Gluster-users mailing list
>> [email protected]
>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>>
> _______________________________________________
> Gluster-users mailing list
> [email protected]
> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>
_______________________________________________
Gluster-users mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-users