Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put

Bob Peterson Fri, 04 Dec 2015 06:59:22 -0800

Hi,

----- Original Message -----
> > Hi Steve,
> >
> > Another possibility is to add a new inode work_queue which does the work of
> > gfs2_clear_inode(). Its only function would be to clear the inode, and
> > therefore, there should not be any competing work, as there is in the case
> > of
> > a glock. The gfs2_clear_inode() function would then queue the work and
> > return
> > to its caller, and therefore it wouldn't block. The work that's been queued
> > might block on DLM, but it shouldn't matter, in theory, since the shrinker
> > will complete, which will free up the fence, which will free up dlm, which
> > will
> > free up everything. What do you think?
> >
> > Regards,
> >
> > Bob Peterson
> > Red Hat File Systems
> 
> Well that is a possibility. There is another dependency cycle there
> however, which is the one relating to the inode lifetime, and we need to
> be careful that we don't break that in attempting to fix the memory
> allocation at fence time issue. I know it isn't easy, but we need to be
> very careful to avoid introducing any new races, as they are likely to
> be very tricky to debug in this area.
> 
> One thing we could do, is to be more careful about when we take this
> delayed path. We could, for example, add a check to see if we are
> running in the context of the shrinker, and only delay the eviction in
> that specific case. I was wondering whether we could move the decision
> further up the chain too, and avoid evicting inodes with zero link count
> under memory pressure in the first place. The issue of needing to
> allocate memory to evict inodes is likely to be something that might
> affect other filesystems too, so it might be useful as a generic feature.
> 
> I'd prefer to avoid adding a new inode workqueue, however there is no
> reason why we could not add a new feature to an existing workqueue by
> adding a new inode or glock flag as appropriate (which will not expand
> the size of the inode) to deal with this case,
> 
> Steve.
> 
>


I tried adding a work queue for the inodes, but it was a disaster and led
to other hangs.

Then I decided it made more sense to just add a new delayed work function
for the final put of the inode glock. That also led to problems because
glocks may be reused for a different inode with the same block address,
due to the fact that the glocks often outlive the inodes. What happened
was that, given enough pressure, the delayed work became backlogged and
the second put couldn't be queued due to a previous queue (with delay==0).

In the end, I decided that you're right: The safest approach is to
isolate this to the specific problem I'm trying to solve: to do our
business the same way, but when it comes to the final close, check if
it's from the fenced process, and if so, queue the final put. That should
mitigate the problem.

Unfortunately, it's not good enough to only do these things when the link
count is zero: Closing any file at any time could recreate the problem,
provided there was a pending fence action.

I'd also prefer if there was a way we could tell whether we were called
from the slab shrinker, but unfortunately, there's no good way to know
without wasting a lot of time traversing the call stack and such.
I did a fair amount of investigating this: there are a couple of bits
in the superblock that I was hoping to leverage: PF_FSTRANS is used by
NFS to detect situations where it mustn't do memory allocations, but
I don't think we can use it.
There's also PF_MEMALLOC, but that's set for the global shrinker, not the
slab shrinker, so I don't think we can use that either.

I investigated how ocfs2 handles this situation: turns out, it doesn't,
so they probably have the same vulnerability.

In doing all this investigation, I learned of another similar possible
problem discovered in ocfs2: If the shrinker tried to ditch an inode from
memory, then the file system calls into DLM, but DLM needs to do some
COMM stuff, and the comm stuff needs to allocate memory for comm buffers.
In that case, the use of superblock flags (as above) was discouraged and
NACKed in favor of using a different set of memory flags, such as GFP_NOFS,
etc. I never did determine exactly how this ended up.

In the meantime, I prototyped the following patch, and it works, but
it's a hack, and I'm not too happy with it. So I'm going to see if I can
find a solution that would work for both GFS2 and ocfs2.

Regards,

Bob Peterson
Red Hat File Systems
---
commit 3142a311fa848d44b15e9f86fc3599b7d9b8b3f5
Author: Bob Peterson <rpete...@redhat.com>
Date:   Mon Nov 30 12:04:09 2015 -0600

GFS2: Make gfs2_clear_inode() not block on final glock put

This patch changes function gfs2_clear_inode() so that instead
of calling gfs2_glock_put, it calls a new gfs2_glock_put_noblock
function that avoids a possible deadlock that would occur should
it call dlm during a fence operation: dlm waits for a fence
operation, the fence operation waits for memory, the shrinker
waits for gfs2 to free an inode from memory, but gfs2 waits for
dlm.

This solution is a bit of a hack: If clear_inode was called from
the fenced process, instead of calling gfs2_glock_put, we queue
the glock state machine which calls the gfs2_glock_put from a
different context.

I've tried many other solutions to no avail. Consider:

1. If we always queue the final glock_put, there's no guarantee
   that the queue_delayed_work is successful, because there could
   already be work queued for that glock.
2. If the queue_delayed_work fails, reacting with a call to
   gfs2_glock_put (however unlikely) might still be the final
   put, which means we could still hang.
3. If we loop around and try again, the queued delayed work that
   prevented us from queueing might run, causing the glock to
   be freed, which would result in a use-after-free problem.
4. If we just forget it, our glock reference could would get off
   and we wouldn't be able to free the glocks from slab when the
   module is unloaded.
5. We can't really use current->flags & PF_FSTRANS like NFS,
   as far as I can tell.
6. We can't really use current->flags & PF_MEMALLOC either,
   like the global shrinker because it's not set for the slab
   shrinker.
7. We can't easily tell if the inode shrinker is in our call
   stack, but it's easy to tell if it's called from fenced, which
   is the most likely to block dlm.

Signed-off-by: Bob Peterson <rpete...@redhat.com>

rhbz#1255872

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index b5e9d2e..36bea46 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -225,6 +225,29 @@ void gfs2_glock_put(struct gfs2_glock *gl)
 }
 
 /**
+ * gfs2_glock_put_noblock() - Decrement reference count on glock
+ * @gl: The glock to put
+ *
+ * This is the same as gfs2_glock_put() but it's not allowed to block.
+ * We're simply trying to avoid a live-lock case in which the fenced process
+ * calls the inode shrinker, which calls the gfs2_clear_inode, which calls
+ * gfs2_glock_put, which calls dlm, which blocks on a fence operation.
+ */
+
+void gfs2_glock_put_noblock(struct gfs2_glock *gl)
+{
+       if (strcmp(current->comm, "fenced") == 0) {
+               if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) != 0)
+                       return;
+
+               printk(KERN_WARNING "GFS2: Unable to queue final glock "
+                      "put for inode.\n");
+               __dump_glock(NULL, gl);
+       }
+       gfs2_glock_put(gl);
+}
+
+/**
  * search_bucket() - Find struct gfs2_glock by lock number
  * @bucket: the bucket to search
  * @name: The lock name
diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index 7e22f20..d9f8073 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -188,6 +188,7 @@ int gfs2_glock_get(struct gfs2_sbd *sdp,
 void gfs2_glock_hold(struct gfs2_glock *gl);
 void gfs2_glock_put_nolock(struct gfs2_glock *gl);
 void gfs2_glock_put(struct gfs2_glock *gl);
+extern void gfs2_glock_put_noblock(struct gfs2_glock *gl);
 void gfs2_holder_init(struct gfs2_glock *gl, unsigned int state, u16 flags,
                      struct gfs2_holder *gh);
 void gfs2_holder_reinit(unsigned int state, u16 flags, struct gfs2_holder *gh);
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 6c8637a..0d5e721 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -1278,7 +1278,7 @@ static void gfs2_clear_inode(struct inode *inode)
        ip->i_gl->gl_object = NULL;
        flush_delayed_work(&ip->i_gl->gl_work);
        gfs2_glock_add_to_lru(ip->i_gl);
-       gfs2_glock_put(ip->i_gl);
+       gfs2_glock_put_noblock(ip->i_gl);
        ip->i_gl = NULL;
        if (ip->i_iopen_gh.gh_gl) {
                ip->i_iopen_gh.gh_gl->gl_object = NULL;

Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put

Reply via email to