Re: [Cluster-devel] [PATCH] gfs2: Revert "Fix loop in gfs2_rbm_find"

2019-01-31 Thread Linus Torvalds
On Thu, Jan 31, 2019 at 11:12 AM Andreas Gruenbacher
 wrote:
>
> That patch just hasn't seen enough testing to make me comfortable
> sending it yet.

Ok, I'll just apply the revert.

Thanks,

  Linus



Re: [Cluster-devel] [PATCH] gfs2: Revert "Fix loop in gfs2_rbm_find"

2019-01-31 Thread Andreas Gruenbacher
On Thu, 31 Jan 2019 at 19:41, Linus Torvalds
 wrote:
> On Wed, Jan 30, 2019 at 12:30 PM Andreas Gruenbacher  
> wrote:
> >
> > This reverts commit 2d29f6b96d8f80322ed2dd895bca590491c38d34.
> >
> > It turns out that the fix can lead to a ~20 percent performance regression
> > in initial writes to the page cache according to iozone.  Let's revert this
> > for now to have more time for a proper fix.
>
> Ugh. I think part of the problem is that the
>
> n += (rbm->bii - initial_bii);
>
> doesn't make any sense when we just set rbm->bii to 0 a couple of
> lines earlier. So it's basically a really odd way to write
>
> n -= initial_bii;
>
> which in turn really doesn't make any sense _either_.

Right, that code clearly doesn't make sense. The only positive about
it is that it hasn't caused any obvious issues so far.

> So I'l all for reverting the commit because it caused a performance
> regression, but the end result really is all kinds of odd.
>
> Is the bug as simple as "we incremented the iterator counter twice"?
> Because in the -E2BIG case, we first increment it by the difference in
> bii, but then we *also* increment it in res_covered_end_of_rgrp()
> (which we do *not* do for the "ret > 0" case, which goes to
> next_iter).

In the "ret > 0" case, rbm->bii should have already been incremented
in gfs2_reservation_check_and_update.

> So if somebody can run the performance test again, how does it look if
> *instead* of the revert, we do what looks at least _slightly_ more
> sensible, and move the "increment iteration count" up to the
> next-bitmap case instead?
>
> At that point, it will actually match the bii updates (because
> next_bitmap also updates rbm->bii). Hmm?
>
> Something like the whitespace-damaged thinig below. Completely
> untested. But *if* this works, it would make more sense than the
> revert..
>
> Hmm?

My idea was to fix that whole loop properly as in this patch instead:

  https://www.redhat.com/archives/cluster-devel/2019-January/msg00111.html

That patch just hasn't seen enough testing to make me comfortable
sending it yet. We're testing it right now, and my plan was to get it
into the next merge window. I don't think an intermediate workaround
would make much sense. Does that sound acceptable? Would you rather
have that fix sent to you sometime next week instead?

Thanks a lot,
Andreas



Re: [Cluster-devel] [PATCH] gfs2: Revert "Fix loop in gfs2_rbm_find"

2019-01-31 Thread Linus Torvalds
On Wed, Jan 30, 2019 at 12:30 PM Andreas Gruenbacher
 wrote:
>
> This reverts commit 2d29f6b96d8f80322ed2dd895bca590491c38d34.
>
> It turns out that the fix can lead to a ~20 percent performance regression
> in initial writes to the page cache according to iozone.  Let's revert this
> for now to have more time for a proper fix.

Ugh. I think part of the problem is that the

n += (rbm->bii - initial_bii);

doesn't make any sense when we just set rbm->bii to 0 a couple of
lines earlier. So it's basically a really odd way to write

n -= initial_bii;

which in turn really doesn't make any sense _either_.

So I'l all for reverting the commit because it caused a performance
regression, but the end result really is all kinds of odd.

Is the bug as simple as "we incremented the iterator counter twice"?
Because in the -E2BIG case, we first increment it by the difference in
bii, but then we *also* increment it in res_covered_end_of_rgrp()
(which we do *not* do for the "ret > 0" case, which goes to
next_iter).

So if somebody can run the performance test again, how does it look if
*instead* of the revert, we do what looks at least _slightly_ more
sensible, and move the "increment iteration count" up to the
next-bitmap case instead?

At that point, it will actually match the bii updates (because
next_bitmap also updates rbm->bii). Hmm?

Something like the whitespace-damaged thinig below. Completely
untested. But *if* this works, it would make more sense than the
revert..

Hmm?

   Linus

--- snip snip ---
  diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
  index 831d7cb5a49c..5b1006d5344f 100644
  --- a/fs/gfs2/rgrp.c
  +++ b/fs/gfs2/rgrp.c
  @@ -1796,10 +1796,10 @@ static int gfs2_rbm_find(struct gfs2_rbm
*rbm, u8 state, u32 *minext,
rbm->bii++;
if (rbm->bii == rbm->rgd->rd_length)
rbm->bii = 0;
  + n++;
   res_covered_end_of_rgrp:
if ((rbm->bii == 0) && nowrap)
break;
  - n++;
   next_iter:
if (n >= iters)
break;



Re: [Cluster-devel] [PATCH 2/2] gfs2: Fix lru_count going negative

2019-01-31 Thread Andreas Gruenbacher
Ross,

On Thu, 31 Jan 2019 at 11:56, Ross Lagerwall  wrote:
>
> Under certain conditions, lru_count may drop below zero resulting in
> a large amount of log spam like this:
>
> vmscan: shrink_slab: gfs2_dump_glock+0x3b0/0x630 [gfs2] \
> negative objects to delete nr=-1
>
> This happens as follows:
> 1) A glock is moved from lru_list to the dispose list and lru_count is
>decremented.
> 2) The dispose function calls cond_resched() and drops the lru lock.
> 3) Another thread takes the lru lock and tries to add the same glock to
>lru_list, checking if the glock is on an lru list.
> 4) It is on a list (actually the dispose list) and so it avoids
>incrementing lru_count.
> 5) The glock is moved to lru_list.
> 5) The original thread doesn't dispose it because it has been re-added
>to the lru list but the lru_count has still decreased by one.
>
> Fix by checking if the LRU flag is set on the glock rather than checking
> if the glock is on some list and rearrange the code so that the LRU flag
> is added/removed precisely when the glock is added/removed from lru_list.

this analysis and the patch both seem to be spot on, thanks a lot.

I've got one minor nit, but that's not related to your changes (see below).

> Signed-off-by: Ross Lagerwall 
> ---
>  fs/gfs2/glock.c | 16 +---
>  1 file changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
> index b92740edc416..53e6c7e0c1b3 100644
> --- a/fs/gfs2/glock.c
> +++ b/fs/gfs2/glock.c
> @@ -185,13 +185,14 @@ void gfs2_glock_add_to_lru(struct gfs2_glock *gl)
>  {

For symmetry and consistency, I think it would make sense to move the
(glops->go_flags & GLOF_LRU) check from gfs2_glock_dq here as well.
That function is also called for inode glocks from gfs2_evict_inode,
but those always have GLOF_LRU set.

> spin_lock(_lock);
>
> -   if (!list_empty(>gl_lru))
> -   list_del_init(>gl_lru);
> -   else
> +   list_del(>gl_lru);
> +   list_add_tail(>gl_lru, _list);
> +
> +   if (!test_bit(GLF_LRU, >gl_flags)) {
> +   set_bit(GLF_LRU, >gl_flags);
> atomic_inc(_count);
> +   }
>
> -   list_add_tail(>gl_lru, _list);
> -   set_bit(GLF_LRU, >gl_flags);
> spin_unlock(_lock);
>  }
>
> @@ -201,7 +202,7 @@ static void gfs2_glock_remove_from_lru(struct gfs2_glock 
> *gl)
> return;
>
> spin_lock(_lock);
> -   if (!list_empty(>gl_lru)) {
> +   if (test_bit(GLF_LRU, >gl_flags)) {
> list_del_init(>gl_lru);
> atomic_dec(_count);
> clear_bit(GLF_LRU, >gl_flags);
> @@ -1456,6 +1457,7 @@ __acquires(_lock)
> if (!spin_trylock(>gl_lockref.lock)) {
>  add_back_to_lru:
> list_add(>gl_lru, _list);
> +   set_bit(GLF_LRU, >gl_flags);
> atomic_inc(_count);
> continue;
> }
> @@ -1463,7 +1465,6 @@ __acquires(_lock)
> spin_unlock(>gl_lockref.lock);
> goto add_back_to_lru;
> }
> -   clear_bit(GLF_LRU, >gl_flags);
> gl->gl_lockref.count++;
> if (demote_ok(gl))
> handle_callback(gl, LM_ST_UNLOCKED, 0, false);
> @@ -1498,6 +1499,7 @@ static long gfs2_scan_glock_lru(int nr)
> if (!test_bit(GLF_LOCK, >gl_flags)) {
> list_move(>gl_lru, );
> atomic_dec(_count);
> +   clear_bit(GLF_LRU, >gl_flags);
> freed++;
> continue;
> }
> --
> 2.17.2
>

Thanks,
Andreas



Re: [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free

2019-01-31 Thread Andreas Gruenbacher
Hi Ross,

On Thu, 31 Jan 2019 at 11:56, Ross Lagerwall  wrote:
> Each gfs2_bufdata stores a reference to a glock but the reference count
> isn't incremented. This causes an occasional use-after-free of the
> glock. Fix by taking a reference on the glock during allocation and
> dropping it when freeing.
>
> Found by KASAN:
>
> BUG: KASAN: use-after-free in revoke_lo_after_commit+0x8e/0xe0 [gfs2]
> Write of size 4 at addr 88801aff6134 by task kworker/0:2H/20371
>
> CPU: 0 PID: 20371 Comm: kworker/0:2H Tainted: G O 4.19.0+0 #1
> Hardware name: Dell Inc. PowerEdge R805/0D456H, BIOS 4.2.1 04/14/2010
> Workqueue: glock_workqueue glock_work_func [gfs2]
> Call Trace:
>  dump_stack+0x71/0xab
>  print_address_description+0x6a/0x270
>  kasan_report+0x258/0x380
>  ? revoke_lo_after_commit+0x8e/0xe0 [gfs2]
>  revoke_lo_after_commit+0x8e/0xe0 [gfs2]
>  gfs2_log_flush+0x511/0xa70 [gfs2]
>  ? gfs2_log_shutdown+0x1f0/0x1f0 [gfs2]
>  ? __brelse+0x48/0x50
>  ? gfs2_log_commit+0x4de/0x6e0 [gfs2]
>  ? gfs2_trans_end+0x18d/0x340 [gfs2]
>  gfs2_ail_empty_gl+0x1ab/0x1c0 [gfs2]
>  ? inode_go_dump+0xe0/0xe0 [gfs2]
>  ? inode_go_sync+0xe4/0x220 [gfs2]
>  inode_go_sync+0xe4/0x220 [gfs2]
>  do_xmote+0x12b/0x290 [gfs2]
>  glock_work_func+0x6f/0x160 [gfs2]
>  process_one_work+0x461/0x790
>  worker_thread+0x69/0x6b0
>  ? process_one_work+0x790/0x790
>  kthread+0x1ae/0x1d0
>  ? kthread_create_worker_on_cpu+0xc0/0xc0
>  ret_from_fork+0x22/0x40

thanks for tracking this down, very interesting.

The consistency model here is that every buffer head that a struct
gfs2_bufdata object is attached to is protected by a glock. Before a
glock can be released, all the buffers under that glock have to be
flushed out and released; this is what allows another node to access
the same on-disk location without causing inconsistencies. When there
is a bufdata object that points to a glock that has already been
freed, this consistency model is broken. Taking an additional refcount
as this patch does may make the use-after-free go away, but it doesn't
fix the underlying problem. So I think we'll need a different fix
here.

Did you observe this problem in a real-world scenario, or with KASAN
only? It might be that we're looking at a small race that is unlikely
to trigger in the field. In any case, I think we need to understand
better what't actually going on.

Thanks,
Andreas



Re: [Cluster-devel] [PATCH 2/2] gfs2: Fix lru_count going negative

2019-01-31 Thread Ross Lagerwall

On 1/31/19 2:36 PM, Bob Peterson wrote:

Hi Ross,

Comments below. Sorry if this is a bit incoherent; it's early and I'm
not properly caffeinated yet.

- Original Message -

Under certain conditions, lru_count may drop below zero resulting in
a large amount of log spam like this:

vmscan: shrink_slab: gfs2_dump_glock+0x3b0/0x630 [gfs2] \
 negative objects to delete nr=-1

This happens as follows:
1) A glock is moved from lru_list to the dispose list and lru_count is
decremented.
2) The dispose function calls cond_resched() and drops the lru lock.
3) Another thread takes the lru lock and tries to add the same glock to
lru_list, checking if the glock is on an lru list.
4) It is on a list (actually the dispose list) and so it avoids
incrementing lru_count.
5) The glock is moved to lru_list.
5) The original thread doesn't dispose it because it has been re-added
to the lru list but the lru_count has still decreased by one.

Fix by checking if the LRU flag is set on the glock rather than checking
if the glock is on some list and rearrange the code so that the LRU flag
is added/removed precisely when the glock is added/removed from lru_list.

Signed-off-by: Ross Lagerwall 
---
  fs/gfs2/glock.c | 16 +---
  1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index b92740edc416..53e6c7e0c1b3 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -185,13 +185,14 @@ void gfs2_glock_add_to_lru(struct gfs2_glock *gl)
  {
spin_lock(_lock);
  
-	if (!list_empty(>gl_lru))

-   list_del_init(>gl_lru);
-   else
+   list_del(>gl_lru);
+   list_add_tail(>gl_lru, _list);


This looks like a bug, and I like your idea of using the GLF_LRU bit
to determine whether or not to do the manipulation, but I have some
concerns. First, does it work with kernel list debugging turned on?

To me it looks like the list_del (as opposed to list_del_init) above
will set entry->next and prev to LIST_POISON values, then the
list_add_tail() calls __list_add() which checks:
if (!__list_add_valid(new, prev, next))
return;
Without list debugging, the value is always returned true, but with
list debugging it checks for circular values of list->prev and list->next
which, since they're LIST_POISON, ought to fail.
So it seems like the original list_del_init is correct.


No, __list_add_valid() checks if prev & next correctly link to each 
other and checks that new is not the same as prev or next. It doesn't 
check anything to do with new's pointers. I've tested with DEBUG_LIST 
and it appears to work.




The intent was: if the glock is already on the lru, take it off
before re-adding it, and the count ought to be okay, because if it's
on the LRU list, it's already been incremented. So taking it off and
adding it back on is a net 0 on the count. But that's only
true if the GLF_LRU bit is set. If it's on a different list (the
dispose list), as you noted, it still needs to be incremented.

If the glock is on the dispose_list, rather than the lru list, we
want to take it off the dispose list and move it to the lru_list,
but in that case, we need to increment the lru count, and not
poison the list_head.

So to me it seems like we should keep the list_del_init, and only
do it if the list isn't empty, but trigger off the GLF_LRU flag
for managing the count. The lru_lock ought to prevent races.


I think I understand the intent, but IMO this patch makes the logic 
clearer than relying on both the LRU bit and the state of the list to 
determine what to do. Thoughts?


Thanks,
--
Ross Lagerwall



Re: [Cluster-devel] [PATCH 2/2] gfs2: Fix lru_count going negative

2019-01-31 Thread Bob Peterson
- Original Message -
> > +
> > +   if (!test_bit(GLF_LRU, >gl_flags)) {
> > +   set_bit(GLF_LRU, >gl_flags);
> > atomic_inc(_count);
> > +   }
> 
> The above may be simplified to something like:
> + if (!test_and_set_bit(GLF_LRU, >gl_flags))
>   atomic_inc(_count);

Scratch that. Andreas says test_and_set_bit() and similar are much more
expensive cpu-wise, and we're already protected from races by the
lru_lock, so I guess the original is better after all.

Bob Peterson



[Cluster-devel] [PATCH] gfs2: Fix loop in gfs2_rbm_find (2)

2019-01-31 Thread Andreas Gruenbacher
This is an updated version of the patch previously posted on January 24.

Changes since then:

 - The previous version didn't take filesystems with a single bitmap
   block per filesystem into account, and could still loop endlessly
   in that case.

 - When starting a scan at the beginning of a bitmap block and we wrap
   around, there is no need to rescan that bitmap block again.  This
   revised version takes that into account as well.

 - At the end of gfs2_rbm_find, we update rd_extfail_pt when the scan
   has started at the beginning of the resource group.  However, we
   should update rd_extfail_pt whenever we have scanned the entire
   resource group, including when we have wrapped around.

--

The fix from commit 2d29f6b96d8f introduced an unexpected performance
regression when allocating blocks.  Fix by rewriting the overly complicated
wrap-around logic in gfs2_rbm_find.  Discovered and verified with iozone.

Fixes: 2d29f6b96d8f ("gfs2: Fix loop in gfs2_rbm_find")
Cc: sta...@vger.kernel.org # v3.13+
Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/rgrp.c | 54 +++---
 1 file changed, 25 insertions(+), 29 deletions(-)

diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 831d7cb5a49c4..52a4f340a8672 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -1729,25 +1729,22 @@ static int gfs2_reservation_check_and_update(struct 
gfs2_rbm *rbm,
 static int gfs2_rbm_find(struct gfs2_rbm *rbm, u8 state, u32 *minext,
 const struct gfs2_inode *ip, bool nowrap)
 {
+   bool scan_from_start = rbm->bii == 0 && rbm->offset == 0;
struct buffer_head *bh;
-   int initial_bii;
-   u32 initial_offset;
-   int first_bii = rbm->bii;
-   u32 first_offset = rbm->offset;
+   int last_bii;
u32 offset;
u8 *buffer;
-   int n = 0;
-   int iters = rbm->rgd->rd_length;
+   bool wrapped = false;
int ret;
struct gfs2_bitmap *bi;
struct gfs2_extent maxext = { .rbm.rgd = rbm->rgd, };
 
-   /* If we are not starting at the beginning of a bitmap, then we
-* need to add one to the bitmap count to ensure that we search
-* the starting bitmap twice.
+   /*
+* Determine the last bitmap to search.  If we're not starting at the
+* beginning of a bitmap, we need to search that bitmap twice to scan
+* the entire resource group.
 */
-   if (rbm->offset != 0)
-   iters++;
+   last_bii = rbm->bii - (rbm->offset == 0);
 
while(1) {
bi = rbm_bi(rbm);
@@ -1761,47 +1758,46 @@ static int gfs2_rbm_find(struct gfs2_rbm *rbm, u8 
state, u32 *minext,
WARN_ON(!buffer_uptodate(bh));
if (state != GFS2_BLKST_UNLINKED && bi->bi_clone)
buffer = bi->bi_clone + bi->bi_offset;
-   initial_offset = rbm->offset;
offset = gfs2_bitfit(buffer, bi->bi_bytes, rbm->offset, state);
-   if (offset == BFITNOENT)
-   goto bitmap_full;
+   if (offset == BFITNOENT) {
+   if (state == GFS2_BLKST_FREE && rbm->offset == 0)
+   set_bit(GBF_FULL, >bi_flags);
+   goto next_bitmap;
+   }
rbm->offset = offset;
if (ip == NULL)
return 0;
 
-   initial_bii = rbm->bii;
ret = gfs2_reservation_check_and_update(rbm, ip,
minext ? *minext : 0,
);
if (ret == 0)
return 0;
-   if (ret > 0) {
-   n += (rbm->bii - initial_bii);
+   if (ret > 0)
goto next_iter;
-   }
if (ret == -E2BIG) {
-   n += rbm->bii - initial_bii;
rbm->bii = 0;
rbm->offset = 0;
goto res_covered_end_of_rgrp;
}
return ret;
 
-bitmap_full:   /* Mark bitmap as full and fall through */
-   if ((state == GFS2_BLKST_FREE) && initial_offset == 0)
-   set_bit(GBF_FULL, >bi_flags);
-
 next_bitmap:   /* Find next bitmap in the rgrp */
rbm->offset = 0;
rbm->bii++;
if (rbm->bii == rbm->rgd->rd_length)
rbm->bii = 0;
 res_covered_end_of_rgrp:
-   if ((rbm->bii == 0) && nowrap)
-   break;
-   n++;
+   if (rbm->bii == 0) {
+   if (wrapped)
+   break;
+   wrapped = true;
+   if (nowrap)
+   break;
+   }
 next_iter:
-   if (n >= iters)
+   /* 

Re: [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free

2019-01-31 Thread Bob Peterson
- Original Message -
> Each gfs2_bufdata stores a reference to a glock but the reference count
> isn't incremented. This causes an occasional use-after-free of the
> glock. Fix by taking a reference on the glock during allocation and
> dropping it when freeing.
> 
> Found by KASAN:
> 
> BUG: KASAN: use-after-free in revoke_lo_after_commit+0x8e/0xe0 [gfs2]
> Write of size 4 at addr 88801aff6134 by task kworker/0:2H/20371
> 
> CPU: 0 PID: 20371 Comm: kworker/0:2H Tainted: G O 4.19.0+0 #1
> Hardware name: Dell Inc. PowerEdge R805/0D456H, BIOS 4.2.1 04/14/2010
> Workqueue: glock_workqueue glock_work_func [gfs2]
> Call Trace:
>  dump_stack+0x71/0xab
>  print_address_description+0x6a/0x270
>  kasan_report+0x258/0x380
>  ? revoke_lo_after_commit+0x8e/0xe0 [gfs2]
>  revoke_lo_after_commit+0x8e/0xe0 [gfs2]
>  gfs2_log_flush+0x511/0xa70 [gfs2]
>  ? gfs2_log_shutdown+0x1f0/0x1f0 [gfs2]
>  ? __brelse+0x48/0x50
>  ? gfs2_log_commit+0x4de/0x6e0 [gfs2]
>  ? gfs2_trans_end+0x18d/0x340 [gfs2]
>  gfs2_ail_empty_gl+0x1ab/0x1c0 [gfs2]
>  ? inode_go_dump+0xe0/0xe0 [gfs2]
>  ? inode_go_sync+0xe4/0x220 [gfs2]
>  inode_go_sync+0xe4/0x220 [gfs2]
>  do_xmote+0x12b/0x290 [gfs2]
>  glock_work_func+0x6f/0x160 [gfs2]
>  process_one_work+0x461/0x790
>  worker_thread+0x69/0x6b0
>  ? process_one_work+0x790/0x790
>  kthread+0x1ae/0x1d0
>  ? kthread_create_worker_on_cpu+0xc0/0xc0
>  ret_from_fork+0x22/0x40
> 
> Allocated by task 20805:
>  kasan_kmalloc+0xa0/0xd0
>  kmem_cache_alloc+0xb5/0x1b0
>  gfs2_glock_get+0x14b/0x620 [gfs2]
>  gfs2_inode_lookup+0x20c/0x640 [gfs2]
>  gfs2_dir_search+0x150/0x180 [gfs2]
>  gfs2_lookupi+0x272/0x360 [gfs2]
>  __gfs2_lookup+0x8b/0x1d0 [gfs2]
>  gfs2_atomic_open+0x77/0x100 [gfs2]
>  path_openat+0x1454/0x1c10
>  do_filp_open+0x124/0x1d0
>  do_sys_open+0x213/0x2c0
>  do_syscall_64+0x69/0x160
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> Freed by task 0:
>  __kasan_slab_free+0x130/0x180
>  kmem_cache_free+0x78/0x1e0
>  rcu_process_callbacks+0x2ad/0x6c0
>  __do_softirq+0x111/0x38c
> 
> The buggy address belongs to the object at 88801aff6040
>  which belongs to the cache gfs2_glock(aspace) of size 560
> The buggy address is located 244 bytes inside of
>  560-byte region [88801aff6040, 88801aff6270)
> ...
> 
> Signed-off-by: Ross Lagerwall 
> ---

This makes sense to me.

Regards,

Bob Peterson



Re: [Cluster-devel] [PATCH 2/2] gfs2: Fix lru_count going negative

2019-01-31 Thread Bob Peterson
Hi Ross,

Comments below. Sorry if this is a bit incoherent; it's early and I'm
not properly caffeinated yet.

- Original Message -
> Under certain conditions, lru_count may drop below zero resulting in
> a large amount of log spam like this:
> 
> vmscan: shrink_slab: gfs2_dump_glock+0x3b0/0x630 [gfs2] \
> negative objects to delete nr=-1
> 
> This happens as follows:
> 1) A glock is moved from lru_list to the dispose list and lru_count is
>decremented.
> 2) The dispose function calls cond_resched() and drops the lru lock.
> 3) Another thread takes the lru lock and tries to add the same glock to
>lru_list, checking if the glock is on an lru list.
> 4) It is on a list (actually the dispose list) and so it avoids
>incrementing lru_count.
> 5) The glock is moved to lru_list.
> 5) The original thread doesn't dispose it because it has been re-added
>to the lru list but the lru_count has still decreased by one.
> 
> Fix by checking if the LRU flag is set on the glock rather than checking
> if the glock is on some list and rearrange the code so that the LRU flag
> is added/removed precisely when the glock is added/removed from lru_list.
> 
> Signed-off-by: Ross Lagerwall 
> ---
>  fs/gfs2/glock.c | 16 +---
>  1 file changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
> index b92740edc416..53e6c7e0c1b3 100644
> --- a/fs/gfs2/glock.c
> +++ b/fs/gfs2/glock.c
> @@ -185,13 +185,14 @@ void gfs2_glock_add_to_lru(struct gfs2_glock *gl)
>  {
>   spin_lock(_lock);
>  
> - if (!list_empty(>gl_lru))
> - list_del_init(>gl_lru);
> - else
> + list_del(>gl_lru);
> + list_add_tail(>gl_lru, _list);

This looks like a bug, and I like your idea of using the GLF_LRU bit
to determine whether or not to do the manipulation, but I have some
concerns. First, does it work with kernel list debugging turned on?

To me it looks like the list_del (as opposed to list_del_init) above
will set entry->next and prev to LIST_POISON values, then the
list_add_tail() calls __list_add() which checks:
if (!__list_add_valid(new, prev, next))
return;
Without list debugging, the value is always returned true, but with
list debugging it checks for circular values of list->prev and list->next
which, since they're LIST_POISON, ought to fail.
So it seems like the original list_del_init is correct.

The intent was: if the glock is already on the lru, take it off
before re-adding it, and the count ought to be okay, because if it's
on the LRU list, it's already been incremented. So taking it off and
adding it back on is a net 0 on the count. But that's only
true if the GLF_LRU bit is set. If it's on a different list (the
dispose list), as you noted, it still needs to be incremented.

If the glock is on the dispose_list, rather than the lru list, we
want to take it off the dispose list and move it to the lru_list,
but in that case, we need to increment the lru count, and not
poison the list_head.

So to me it seems like we should keep the list_del_init, and only
do it if the list isn't empty, but trigger off the GLF_LRU flag
for managing the count. The lru_lock ought to prevent races.

> +
> + if (!test_bit(GLF_LRU, >gl_flags)) {
> + set_bit(GLF_LRU, >gl_flags);
>   atomic_inc(_count);
> + }

The above may be simplified to something like:
+   if (!test_and_set_bit(GLF_LRU, >gl_flags))
atomic_inc(_count);

>  
> - list_add_tail(>gl_lru, _list);
> - set_bit(GLF_LRU, >gl_flags);
>   spin_unlock(_lock);
>  }
>  
> @@ -201,7 +202,7 @@ static void gfs2_glock_remove_from_lru(struct gfs2_glock
> *gl)
>   return;
>  
>   spin_lock(_lock);
> - if (!list_empty(>gl_lru)) {
> + if (test_bit(GLF_LRU, >gl_flags)) {
>   list_del_init(>gl_lru);
>   atomic_dec(_count);
>   clear_bit(GLF_LRU, >gl_flags);

Here again, we could simplify with test_and_clear_bit above.

> @@ -1456,6 +1457,7 @@ __acquires(_lock)
>   if (!spin_trylock(>gl_lockref.lock)) {
>  add_back_to_lru:
>   list_add(>gl_lru, _list);
> + set_bit(GLF_LRU, >gl_flags);
>   atomic_inc(_count);
>   continue;
>   }
> @@ -1463,7 +1465,6 @@ __acquires(_lock)
>   spin_unlock(>gl_lockref.lock);
>   goto add_back_to_lru;
>   }
> - clear_bit(GLF_LRU, >gl_flags);
>   gl->gl_lockref.count++;
>   if (demote_ok(gl))
>   handle_callback(gl, LM_ST_UNLOCKED, 0, false);
> @@ -1498,6 +1499,7 @@ static long gfs2_scan_glock_lru(int nr)
>   if (!test_bit(GLF_LOCK, >gl_flags)) {
>   list_move(>gl_lru, );
>   atomic_dec(_count);
> + clear_bit(GLF_LRU, >gl_flags);
>   freed++;
> 

Re: [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free

2019-01-31 Thread Steven Whitehouse

Hi,

On 31/01/2019 10:55, Ross Lagerwall wrote:

Each gfs2_bufdata stores a reference to a glock but the reference count
isn't incremented. This causes an occasional use-after-free of the
glock. Fix by taking a reference on the glock during allocation and
dropping it when freeing.


Another good bit of debugging. It would be nice if we can (longer term) 
avoid using the ref count though, since that will have some overhead, 
but for the time being, the correctness is the important thing,


Steve.



Found by KASAN:

BUG: KASAN: use-after-free in revoke_lo_after_commit+0x8e/0xe0 [gfs2]
Write of size 4 at addr 88801aff6134 by task kworker/0:2H/20371

CPU: 0 PID: 20371 Comm: kworker/0:2H Tainted: G O 4.19.0+0 #1
Hardware name: Dell Inc. PowerEdge R805/0D456H, BIOS 4.2.1 04/14/2010
Workqueue: glock_workqueue glock_work_func [gfs2]
Call Trace:
  dump_stack+0x71/0xab
  print_address_description+0x6a/0x270
  kasan_report+0x258/0x380
  ? revoke_lo_after_commit+0x8e/0xe0 [gfs2]
  revoke_lo_after_commit+0x8e/0xe0 [gfs2]
  gfs2_log_flush+0x511/0xa70 [gfs2]
  ? gfs2_log_shutdown+0x1f0/0x1f0 [gfs2]
  ? __brelse+0x48/0x50
  ? gfs2_log_commit+0x4de/0x6e0 [gfs2]
  ? gfs2_trans_end+0x18d/0x340 [gfs2]
  gfs2_ail_empty_gl+0x1ab/0x1c0 [gfs2]
  ? inode_go_dump+0xe0/0xe0 [gfs2]
  ? inode_go_sync+0xe4/0x220 [gfs2]
  inode_go_sync+0xe4/0x220 [gfs2]
  do_xmote+0x12b/0x290 [gfs2]
  glock_work_func+0x6f/0x160 [gfs2]
  process_one_work+0x461/0x790
  worker_thread+0x69/0x6b0
  ? process_one_work+0x790/0x790
  kthread+0x1ae/0x1d0
  ? kthread_create_worker_on_cpu+0xc0/0xc0
  ret_from_fork+0x22/0x40

Allocated by task 20805:
  kasan_kmalloc+0xa0/0xd0
  kmem_cache_alloc+0xb5/0x1b0
  gfs2_glock_get+0x14b/0x620 [gfs2]
  gfs2_inode_lookup+0x20c/0x640 [gfs2]
  gfs2_dir_search+0x150/0x180 [gfs2]
  gfs2_lookupi+0x272/0x360 [gfs2]
  __gfs2_lookup+0x8b/0x1d0 [gfs2]
  gfs2_atomic_open+0x77/0x100 [gfs2]
  path_openat+0x1454/0x1c10
  do_filp_open+0x124/0x1d0
  do_sys_open+0x213/0x2c0
  do_syscall_64+0x69/0x160
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Freed by task 0:
  __kasan_slab_free+0x130/0x180
  kmem_cache_free+0x78/0x1e0
  rcu_process_callbacks+0x2ad/0x6c0
  __do_softirq+0x111/0x38c

The buggy address belongs to the object at 88801aff6040
  which belongs to the cache gfs2_glock(aspace) of size 560
The buggy address is located 244 bytes inside of
  560-byte region [88801aff6040, 88801aff6270)
...

Signed-off-by: Ross Lagerwall 
---
  fs/gfs2/aops.c| 3 +--
  fs/gfs2/lops.c| 2 +-
  fs/gfs2/meta_io.c | 2 +-
  fs/gfs2/trans.c   | 9 -
  fs/gfs2/trans.h   | 2 ++
  5 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 05dd78f4b2b3..8c2b572a7fb1 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -868,9 +868,8 @@ int gfs2_releasepage(struct page *page, gfp_t gfp_mask)
gfs2_assert_warn(sdp, bd->bd_bh == bh);
if (!list_empty(>bd_list))
list_del_init(>bd_list);
-   bd->bd_bh = NULL;
bh->b_private = NULL;
-   kmem_cache_free(gfs2_bufdata_cachep, bd);
+   gfs2_free_bufdata(bd);
}
  
  		bh = bh->b_this_page;

diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 94dcab655bc0..f40be71677d1 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -847,7 +847,7 @@ static void revoke_lo_after_commit(struct gfs2_sbd *sdp, 
struct gfs2_trans *tr)
gl = bd->bd_gl;
atomic_dec(>gl_revokes);
clear_bit(GLF_LFLUSH, >gl_flags);
-   kmem_cache_free(gfs2_bufdata_cachep, bd);
+   gfs2_free_bufdata(bd);
}
  }
  
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c

index be9c0bf697fe..868caa0eb104 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -355,7 +355,7 @@ void gfs2_remove_from_journal(struct buffer_head *bh, int 
meta)
gfs2_trans_add_revoke(sdp, bd);
} else if (was_pinned) {
bh->b_private = NULL;
-   kmem_cache_free(gfs2_bufdata_cachep, bd);
+   gfs2_free_bufdata(bd);
}
spin_unlock(>sd_ail_lock);
}
diff --git a/fs/gfs2/trans.c b/fs/gfs2/trans.c
index cd9a94a6b5bb..423cbee8fa08 100644
--- a/fs/gfs2/trans.c
+++ b/fs/gfs2/trans.c
@@ -133,9 +133,16 @@ static struct gfs2_bufdata *gfs2_alloc_bufdata(struct 
gfs2_glock *gl,
bd->bd_gl = gl;
INIT_LIST_HEAD(>bd_list);
bh->b_private = bd;
+   gfs2_glock_hold(gl);
return bd;
  }
  
+void gfs2_free_bufdata(struct gfs2_bufdata *bd)

+{
+   gfs2_glock_put(bd->bd_gl);
+   kmem_cache_free(gfs2_bufdata_cachep, bd);
+}
+
  /**
   * gfs2_trans_add_data - Add a databuf to the transaction.
   * @gl: The inode glock associated with the buffer
@@ -265,7 +272,7 @@ void gfs2_trans_add_unrevoke(struct gfs2_sbd *sdp, u64 

Re: [Cluster-devel] [PATCH 2/2] gfs2: Fix lru_count going negative

2019-01-31 Thread Steven Whitehouse

Hi,

On 31/01/2019 10:55, Ross Lagerwall wrote:

Under certain conditions, lru_count may drop below zero resulting in
a large amount of log spam like this:

vmscan: shrink_slab: gfs2_dump_glock+0x3b0/0x630 [gfs2] \
 negative objects to delete nr=-1

This happens as follows:
1) A glock is moved from lru_list to the dispose list and lru_count is
decremented.
2) The dispose function calls cond_resched() and drops the lru lock.
3) Another thread takes the lru lock and tries to add the same glock to
lru_list, checking if the glock is on an lru list.
4) It is on a list (actually the dispose list) and so it avoids
incrementing lru_count.
5) The glock is moved to lru_list.
5) The original thread doesn't dispose it because it has been re-added
to the lru list but the lru_count has still decreased by one.

Fix by checking if the LRU flag is set on the glock rather than checking
if the glock is on some list and rearrange the code so that the LRU flag
is added/removed precisely when the glock is added/removed from lru_list.

Signed-off-by: Ross Lagerwall 


I'm glad we've got to the bottom of that one. Excellent work debugging 
that! Many thanks for the fix,


Steve.



---
  fs/gfs2/glock.c | 16 +---
  1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index b92740edc416..53e6c7e0c1b3 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -185,13 +185,14 @@ void gfs2_glock_add_to_lru(struct gfs2_glock *gl)
  {
spin_lock(_lock);
  
-	if (!list_empty(>gl_lru))

-   list_del_init(>gl_lru);
-   else
+   list_del(>gl_lru);
+   list_add_tail(>gl_lru, _list);
+
+   if (!test_bit(GLF_LRU, >gl_flags)) {
+   set_bit(GLF_LRU, >gl_flags);
atomic_inc(_count);
+   }
  
-	list_add_tail(>gl_lru, _list);

-   set_bit(GLF_LRU, >gl_flags);
spin_unlock(_lock);
  }
  
@@ -201,7 +202,7 @@ static void gfs2_glock_remove_from_lru(struct gfs2_glock *gl)

return;
  
  	spin_lock(_lock);

-   if (!list_empty(>gl_lru)) {
+   if (test_bit(GLF_LRU, >gl_flags)) {
list_del_init(>gl_lru);
atomic_dec(_count);
clear_bit(GLF_LRU, >gl_flags);
@@ -1456,6 +1457,7 @@ __acquires(_lock)
if (!spin_trylock(>gl_lockref.lock)) {
  add_back_to_lru:
list_add(>gl_lru, _list);
+   set_bit(GLF_LRU, >gl_flags);
atomic_inc(_count);
continue;
}
@@ -1463,7 +1465,6 @@ __acquires(_lock)
spin_unlock(>gl_lockref.lock);
goto add_back_to_lru;
}
-   clear_bit(GLF_LRU, >gl_flags);
gl->gl_lockref.count++;
if (demote_ok(gl))
handle_callback(gl, LM_ST_UNLOCKED, 0, false);
@@ -1498,6 +1499,7 @@ static long gfs2_scan_glock_lru(int nr)
if (!test_bit(GLF_LOCK, >gl_flags)) {
list_move(>gl_lru, );
atomic_dec(_count);
+   clear_bit(GLF_LRU, >gl_flags);
freed++;
continue;
}




[Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free

2019-01-31 Thread Ross Lagerwall
Each gfs2_bufdata stores a reference to a glock but the reference count
isn't incremented. This causes an occasional use-after-free of the
glock. Fix by taking a reference on the glock during allocation and
dropping it when freeing.

Found by KASAN:

BUG: KASAN: use-after-free in revoke_lo_after_commit+0x8e/0xe0 [gfs2]
Write of size 4 at addr 88801aff6134 by task kworker/0:2H/20371

CPU: 0 PID: 20371 Comm: kworker/0:2H Tainted: G O 4.19.0+0 #1
Hardware name: Dell Inc. PowerEdge R805/0D456H, BIOS 4.2.1 04/14/2010
Workqueue: glock_workqueue glock_work_func [gfs2]
Call Trace:
 dump_stack+0x71/0xab
 print_address_description+0x6a/0x270
 kasan_report+0x258/0x380
 ? revoke_lo_after_commit+0x8e/0xe0 [gfs2]
 revoke_lo_after_commit+0x8e/0xe0 [gfs2]
 gfs2_log_flush+0x511/0xa70 [gfs2]
 ? gfs2_log_shutdown+0x1f0/0x1f0 [gfs2]
 ? __brelse+0x48/0x50
 ? gfs2_log_commit+0x4de/0x6e0 [gfs2]
 ? gfs2_trans_end+0x18d/0x340 [gfs2]
 gfs2_ail_empty_gl+0x1ab/0x1c0 [gfs2]
 ? inode_go_dump+0xe0/0xe0 [gfs2]
 ? inode_go_sync+0xe4/0x220 [gfs2]
 inode_go_sync+0xe4/0x220 [gfs2]
 do_xmote+0x12b/0x290 [gfs2]
 glock_work_func+0x6f/0x160 [gfs2]
 process_one_work+0x461/0x790
 worker_thread+0x69/0x6b0
 ? process_one_work+0x790/0x790
 kthread+0x1ae/0x1d0
 ? kthread_create_worker_on_cpu+0xc0/0xc0
 ret_from_fork+0x22/0x40

Allocated by task 20805:
 kasan_kmalloc+0xa0/0xd0
 kmem_cache_alloc+0xb5/0x1b0
 gfs2_glock_get+0x14b/0x620 [gfs2]
 gfs2_inode_lookup+0x20c/0x640 [gfs2]
 gfs2_dir_search+0x150/0x180 [gfs2]
 gfs2_lookupi+0x272/0x360 [gfs2]
 __gfs2_lookup+0x8b/0x1d0 [gfs2]
 gfs2_atomic_open+0x77/0x100 [gfs2]
 path_openat+0x1454/0x1c10
 do_filp_open+0x124/0x1d0
 do_sys_open+0x213/0x2c0
 do_syscall_64+0x69/0x160
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Freed by task 0:
 __kasan_slab_free+0x130/0x180
 kmem_cache_free+0x78/0x1e0
 rcu_process_callbacks+0x2ad/0x6c0
 __do_softirq+0x111/0x38c

The buggy address belongs to the object at 88801aff6040
 which belongs to the cache gfs2_glock(aspace) of size 560
The buggy address is located 244 bytes inside of
 560-byte region [88801aff6040, 88801aff6270)
...

Signed-off-by: Ross Lagerwall 
---
 fs/gfs2/aops.c| 3 +--
 fs/gfs2/lops.c| 2 +-
 fs/gfs2/meta_io.c | 2 +-
 fs/gfs2/trans.c   | 9 -
 fs/gfs2/trans.h   | 2 ++
 5 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 05dd78f4b2b3..8c2b572a7fb1 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -868,9 +868,8 @@ int gfs2_releasepage(struct page *page, gfp_t gfp_mask)
gfs2_assert_warn(sdp, bd->bd_bh == bh);
if (!list_empty(>bd_list))
list_del_init(>bd_list);
-   bd->bd_bh = NULL;
bh->b_private = NULL;
-   kmem_cache_free(gfs2_bufdata_cachep, bd);
+   gfs2_free_bufdata(bd);
}
 
bh = bh->b_this_page;
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 94dcab655bc0..f40be71677d1 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -847,7 +847,7 @@ static void revoke_lo_after_commit(struct gfs2_sbd *sdp, 
struct gfs2_trans *tr)
gl = bd->bd_gl;
atomic_dec(>gl_revokes);
clear_bit(GLF_LFLUSH, >gl_flags);
-   kmem_cache_free(gfs2_bufdata_cachep, bd);
+   gfs2_free_bufdata(bd);
}
 }
 
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index be9c0bf697fe..868caa0eb104 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -355,7 +355,7 @@ void gfs2_remove_from_journal(struct buffer_head *bh, int 
meta)
gfs2_trans_add_revoke(sdp, bd);
} else if (was_pinned) {
bh->b_private = NULL;
-   kmem_cache_free(gfs2_bufdata_cachep, bd);
+   gfs2_free_bufdata(bd);
}
spin_unlock(>sd_ail_lock);
}
diff --git a/fs/gfs2/trans.c b/fs/gfs2/trans.c
index cd9a94a6b5bb..423cbee8fa08 100644
--- a/fs/gfs2/trans.c
+++ b/fs/gfs2/trans.c
@@ -133,9 +133,16 @@ static struct gfs2_bufdata *gfs2_alloc_bufdata(struct 
gfs2_glock *gl,
bd->bd_gl = gl;
INIT_LIST_HEAD(>bd_list);
bh->b_private = bd;
+   gfs2_glock_hold(gl);
return bd;
 }
 
+void gfs2_free_bufdata(struct gfs2_bufdata *bd)
+{
+   gfs2_glock_put(bd->bd_gl);
+   kmem_cache_free(gfs2_bufdata_cachep, bd);
+}
+
 /**
  * gfs2_trans_add_data - Add a databuf to the transaction.
  * @gl: The inode glock associated with the buffer
@@ -265,7 +272,7 @@ void gfs2_trans_add_unrevoke(struct gfs2_sbd *sdp, u64 
blkno, unsigned int len)
list_del_init(>bd_list);
gfs2_assert_withdraw(sdp, sdp->sd_log_num_revoke);
sdp->sd_log_num_revoke--;
-   kmem_cache_free(gfs2_bufdata_cachep, bd);
+   gfs2_free_bufdata(bd);

[Cluster-devel] [PATCH 0/2] GFS2 counting fixes

2019-01-31 Thread Ross Lagerwall
Here are a couple of fixes for GFS2 (ref-)counting going wrong.

Ross Lagerwall (2):
  gfs2: Fix occasional glock use-after-free
  gfs2: Fix lru_count going negative

 fs/gfs2/aops.c|  3 +--
 fs/gfs2/glock.c   | 16 +---
 fs/gfs2/lops.c|  2 +-
 fs/gfs2/meta_io.c |  2 +-
 fs/gfs2/trans.c   |  9 -
 fs/gfs2/trans.h   |  2 ++
 6 files changed, 22 insertions(+), 12 deletions(-)

-- 
2.17.2



[Cluster-devel] [PATCH 2/2] gfs2: Fix lru_count going negative

2019-01-31 Thread Ross Lagerwall
Under certain conditions, lru_count may drop below zero resulting in
a large amount of log spam like this:

vmscan: shrink_slab: gfs2_dump_glock+0x3b0/0x630 [gfs2] \
negative objects to delete nr=-1

This happens as follows:
1) A glock is moved from lru_list to the dispose list and lru_count is
   decremented.
2) The dispose function calls cond_resched() and drops the lru lock.
3) Another thread takes the lru lock and tries to add the same glock to
   lru_list, checking if the glock is on an lru list.
4) It is on a list (actually the dispose list) and so it avoids
   incrementing lru_count.
5) The glock is moved to lru_list.
5) The original thread doesn't dispose it because it has been re-added
   to the lru list but the lru_count has still decreased by one.

Fix by checking if the LRU flag is set on the glock rather than checking
if the glock is on some list and rearrange the code so that the LRU flag
is added/removed precisely when the glock is added/removed from lru_list.

Signed-off-by: Ross Lagerwall 
---
 fs/gfs2/glock.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index b92740edc416..53e6c7e0c1b3 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -185,13 +185,14 @@ void gfs2_glock_add_to_lru(struct gfs2_glock *gl)
 {
spin_lock(_lock);
 
-   if (!list_empty(>gl_lru))
-   list_del_init(>gl_lru);
-   else
+   list_del(>gl_lru);
+   list_add_tail(>gl_lru, _list);
+
+   if (!test_bit(GLF_LRU, >gl_flags)) {
+   set_bit(GLF_LRU, >gl_flags);
atomic_inc(_count);
+   }
 
-   list_add_tail(>gl_lru, _list);
-   set_bit(GLF_LRU, >gl_flags);
spin_unlock(_lock);
 }
 
@@ -201,7 +202,7 @@ static void gfs2_glock_remove_from_lru(struct gfs2_glock 
*gl)
return;
 
spin_lock(_lock);
-   if (!list_empty(>gl_lru)) {
+   if (test_bit(GLF_LRU, >gl_flags)) {
list_del_init(>gl_lru);
atomic_dec(_count);
clear_bit(GLF_LRU, >gl_flags);
@@ -1456,6 +1457,7 @@ __acquires(_lock)
if (!spin_trylock(>gl_lockref.lock)) {
 add_back_to_lru:
list_add(>gl_lru, _list);
+   set_bit(GLF_LRU, >gl_flags);
atomic_inc(_count);
continue;
}
@@ -1463,7 +1465,6 @@ __acquires(_lock)
spin_unlock(>gl_lockref.lock);
goto add_back_to_lru;
}
-   clear_bit(GLF_LRU, >gl_flags);
gl->gl_lockref.count++;
if (demote_ok(gl))
handle_callback(gl, LM_ST_UNLOCKED, 0, false);
@@ -1498,6 +1499,7 @@ static long gfs2_scan_glock_lru(int nr)
if (!test_bit(GLF_LOCK, >gl_flags)) {
list_move(>gl_lru, );
atomic_dec(_count);
+   clear_bit(GLF_LRU, >gl_flags);
freed++;
continue;
}
-- 
2.17.2