On Sat, Feb 13, 2016 at 12:38 AM, Richard Wareing <[email protected]> wrote:
> Hey, > > Sorry for the late reply but I missed this e-mail. With respect to > identifying locking domains, we use the identical logic that GlusterFS > itself uses to identify the domains; which is just a simple string > comparison if I'm not mistaken. System processes (SHD/Rebalance) locking > domains are treated identical to any other, this is specifically critical > to things like DHT healing as this locking domain is used both in userland > and by SHDs (you cannot disable DHT healing). > We cannot disable DHT healing altogether. But we _can_ identify whether healing is done by a mount process (on behalf of application) or a rebalance process. All internal processes (rebalance, shd, quotad etc) have a negative value in frame->root->pid (as opposed to a positive value for a fop request from a mount process). I agree with you that just by looking at domain, we cannot figure out whether lock request is from internal process or a mount process. But, with the help of frame->root->pid, we can. By choosing to flush locks from rebalance process (instead of locks from mount process), I thought we can reduce the scenarios where application sees errors. Of course we'll see more of rebalance failures, but that is a trade-off we perhaps have to live with. Just a thought :). > To illustrate this, consider the case where a SHD holds a lock to do a > DHT heal but can't because of GFID split-brain....a user comes along a > hammers that directory attempting to get a lock....you can pretty much kiss > your cluster good-bye after that :). > > With this in mind, we explicitly choose not to respect system process > (SHD/rebalance) locks any more than a user lock request as they can be just > as likely (if not more so) to cause a system to fall over vs. a user (see > example above). Although this might seem unwise at first, I'd put forth > that having clusters fall over catastrophically pushes far worse decisions > on operators such as re-kicking random bricks or entire clusters in > desperate attempts at freeing locks (the CLI is often unable to free the > locks in our experience) or stopping run away memory consumption due to > frames piling up on the bricks. To date, we haven't even observed a single > instance of data corruption (and we've been looking for it!) due to this > feature. > > We've even used it on clusters where they were on the verge of falling > over and we enable revocation and the entire system stabilizes almost > instantly (it's really like magic when you see it :) ). > > Hope this helps! > > Richard > > > ------------------------------ > *From:* [email protected] [[email protected]] on behalf of > Raghavendra G [[email protected]] > *Sent:* Tuesday, January 26, 2016 9:49 PM > *To:* Raghavendra Gowdappa > *Cc:* Richard Wareing; Gluster Devel > *Subject:* Re: [Gluster-devel] Feature: Automagic lock-revocation for > features/locks xlator (v3.7.x) > > > > On Mon, Jan 25, 2016 at 10:39 AM, Raghavendra Gowdappa < > [email protected]> wrote: > >> >> >> ----- Original Message ----- >> > From: "Richard Wareing" <[email protected]> >> > To: "Pranith Kumar Karampuri" <[email protected]> >> > Cc: [email protected] >> > Sent: Monday, January 25, 2016 8:17:11 AM >> > Subject: Re: [Gluster-devel] Feature: Automagic lock-revocation for >> features/locks xlator (v3.7.x) >> > >> > Yup per domain would be useful, the patch itself currently honors >> domains as >> > well. So locks in a different domains will not be touched during >> revocation. >> > >> > In our cases we actually prefer to pull the plug on SHD/DHT domains to >> ensure >> > clients do not hang, this is important for DHT self heals which cannot >> be >> > disabled via any option, we've found in most cases once we reap the lock >> > another properly behaving client comes along and completes the DHT heal >> > properly. >> >> Flushing waiting locks of DHT can affect application continuity too. >> Though locks requested by rebalance process can be flushed to certain >> extent without applications noticing any failures, there is no guarantee >> that locks requested in DHT_LAYOUT_HEAL_DOMAIN and DHT_FILE_MIGRATE_DOMAIN, >> are issued by only rebalance process. > > > I missed this point in my previous mail. Now I remember that we can use > frame->root->pid (being negative) to identify internal processes. Was this > the approach you followed to identify locks from rebalance process? > > >> These two domains are used for locks to synchronize among and between >> rebalance process(es) and client(s). So, there is equal probability that >> these locks might be requests from clients and hence application can see >> some file operations failing. >> >> In case of pulling plug on DHT_LAYOUT_HEAL_DOMAIN, dentry operations that >> depend on layout can fail. These operations can include create, link, >> unlink, symlink, mknod, mkdir, rename for files/directory within the >> directory on which lock request is failed. >> >> In case of pulling plug on DHT_FILE_MIGRATE_DOMAIN, rename of immediate >> subdirectories/files can fail. >> >> >> > >> > Richard >> > >> > >> > Sent from my iPhone >> > >> > On Jan 24, 2016, at 6:42 PM, Pranith Kumar Karampuri < >> [email protected] > >> > wrote: >> > >> > >> > >> > >> > >> > >> > On 01/25/2016 02:17 AM, Richard Wareing wrote: >> > >> > >> > >> > Hello all, >> > >> > Just gave a talk at SCaLE 14x today and I mentioned our new locks >> revocation >> > feature which has had a significant impact on our GFS cluster >> reliability. >> > As such I wanted to share the patch with the community, so here's the >> > bugzilla report: >> > >> > https://bugzilla.redhat.com/show_bug.cgi?id=1301401 >> > >> > ===== >> > Summary: >> > Mis-behaving brick clients (gNFSd, FUSE, gfAPI) can cause cluster >> instability >> > and eventual complete unavailability due to failures in releasing >> > entry/inode locks in a timely manner. >> > >> > Classic symptoms on this are increased brick (and/or gNFSd) memory >> usage due >> > the high number of (lock request) frames piling up in the processes. The >> > failure-mode results in bricks eventually slowing down to a crawl due to >> > swapping, or OOMing due to complete memory exhaustion; during this >> period >> > the entire cluster can begin to fail. End-users will experience this as >> > hangs on the filesystem, first in a specific region of the file-system >> and >> > ultimately the entire filesystem as the offending brick begins to turn >> into >> > a zombie (i.e. not quite dead, but not quite alive either). >> > >> > Currently, these situations must be handled by an administrator >> detecting & >> > intervening via the "clear-locks" CLI command. Unfortunately this >> doesn't >> > scale for large numbers of clusters, and it depends on the correct >> > (external) detection of the locks piling up (for which there is little >> > signal other than state dumps). >> > >> > This patch introduces two features to remedy this situation: >> > >> > 1. Monkey-unlocking - This is a feature targeted at developers (only!) >> to >> > help track down crashes due to stale locks, and prove the utility of he >> lock >> > revocation feature. It does this by silently dropping 1% of unlock >> requests; >> > simulating bugs or mis-behaving clients. >> > >> > The feature is activated via: >> > features.locks-monkey-unlocking <on/off> >> > >> > You'll see the message >> > "[<timestamp>] W [inodelk.c:653:pl_inode_setlk] 0-groot-locks: MONKEY >> LOCKING >> > (forcing stuck lock)!" ... in the logs indicating a request has been >> > dropped. >> > >> > 2. Lock revocation - Once enabled, this feature will revoke a >> *contended*lock >> > (i.e. if nobody else asks for the lock, we will not revoke it) either >> by the >> > amount of time the lock has been held, how many other lock requests are >> > waiting on the lock to be freed, or some combination of both. Clients >> which >> > are losing their locks will be notified by receiving EAGAIN (send back >> to >> > their callback function). >> > >> > The feature is activated via these options: >> > features.locks-revocation-secs <integer; 0 to disable> >> > features.locks-revocation-clear-all [on/off] >> > features.locks-revocation-max-blocked <integer> >> > >> > Recommended settings are: 1800 seconds for a time based timeout (give >> clients >> > the benefit of the doubt, or chose a max-blocked requires some >> > experimentation depending on your workload, but generally values of >> hundreds >> > to low thousands (it's normal for many ten's of locks to be taken out >> when >> > files are being written @ high throughput). >> > >> > I really like this feature. One question though, self-heal, rebalance >> domain >> > locks are active until self-heal/rebalance is complete which can take >> more >> > than 30 minutes if the files are in TBs. I will try to see what we can >> do to >> > handle these without increasing the revocation-secs too much. May be we >> can >> > come up with per domain revocation timeouts. Comments are welcome. >> > >> > Pranith >> > >> > >> > >> > >> > ===== >> > >> > The patch supplied will patch clean the the v3.7.6 release tag, and >> probably >> > to any 3.7.x release & master (posix locks xlator is rarely touched). >> > >> > Richard >> > >> > >> > >> > >> > >> > _______________________________________________ >> > Gluster-devel mailing list [email protected] >> > http://www.gluster.org/mailman/listinfo/gluster-devel >> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gluster.org_mailman_listinfo_gluster-2Ddevel&d=CwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=qJ8Lp7ySfpQklq3QZr44Iw&m=PZGWfy5Y16_RQ1y2K73ShgTiHmTdeo4ZTHlPNp5ENd8&s=hgqwXn2r8O02F2c4nk3ssRN_DNRBtaa7vsBxJbRwi1g&e=> >> > >> > >> > _______________________________________________ >> > Gluster-devel mailing list >> > [email protected] >> > http://www.gluster.org/mailman/listinfo/gluster-devel >> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gluster.org_mailman_listinfo_gluster-2Ddevel&d=CwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=qJ8Lp7ySfpQklq3QZr44Iw&m=PZGWfy5Y16_RQ1y2K73ShgTiHmTdeo4ZTHlPNp5ENd8&s=hgqwXn2r8O02F2c4nk3ssRN_DNRBtaa7vsBxJbRwi1g&e=> >> _______________________________________________ >> Gluster-devel mailing list >> [email protected] >> http://www.gluster.org/mailman/listinfo/gluster-devel >> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gluster.org_mailman_listinfo_gluster-2Ddevel&d=CwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=qJ8Lp7ySfpQklq3QZr44Iw&m=PZGWfy5Y16_RQ1y2K73ShgTiHmTdeo4ZTHlPNp5ENd8&s=hgqwXn2r8O02F2c4nk3ssRN_DNRBtaa7vsBxJbRwi1g&e=> >> > > > > -- > Raghavendra G > > _______________________________________________ > Gluster-devel mailing list > [email protected] > http://www.gluster.org/mailman/listinfo/gluster-devel > -- Raghavendra G
_______________________________________________ Gluster-devel mailing list [email protected] http://www.gluster.org/mailman/listinfo/gluster-devel
