Re: [Cluster-devel] [GFS2 PATCH] GFS2: Don't brelse rgrp buffer_heads every allocation

2015-06-15 Thread Steven Whitehouse

Hi,


On 12/06/15 20:50, Bob Peterson wrote:

- Original Message -

Hi,


On 09/06/15 15:45, Bob Peterson wrote:

- Original Message -

Hi,


On 05/06/15 15:49, Bob Peterson wrote:

Hi,

This patch allows the block allocation code to retain the buffers
for the resource groups so they don't need to be re-read from buffer
cache with every request. This is a performance improvement that's
especially noticeable when resource groups are very large. For
example, with 2GB resource groups and 4K blocks, there can be 33
blocks for every resource group. This patch allows those 33 buffers
to be kept around and not read in and thrown away with every
operation. The buffers are released when the resource group is
either synced or invalidated.

The blocks should be cached between operations, so this should only be
resulting in a skip of the look up of the cached block, and no changes
to the actual I/O. Does that mean that grab_cache_page() is slow I
wonder? Or is this an issue of going around the retry loop due to lack
of memory at some stage?

How does this interact with the rgrplvb support? I'd guess that with
that turned on, this is no longer an issue, because we'd only read in
the blocks for the rgrps that we are actually going to use?



Steve.

Hi,

If you compare the two vmstat outputs in the bugzilla #1154782, you'll
see no significant difference in memory usage nor cpu usage. So I assume
the page lookup is the slow part; not because it's such a slow thing
but because it's done 33 times per read-reference-invalidate (33 pages
to look up per rgrp).

Regards,

Bob Peterson
Red Hat File Systems

Thats true, however, as I understand the problem here, the issue is not
reading in the blocks for the rgrp that is eventually selected to use,
but the reading in of those blocks for the rgrps that we reject, for
whatever reason (full, or congested, or whatever). So with rgrplvb
enabled, we don't then read those rgrps in off disk at all in most cases
- so I was wondering whether that solves the problem without needing
this change?

Ideally I'd like to make the rgrplvb setting the default, since it is
much more efficient. The question is how we can do that and still remain
backward compatible? Not an easy one to answer :(

Also, if the page lookup is the slow thing, then we should look at using
pagevec_lookup() to get the pages in chunks rather than doing it
individually (and indeed, multiple times per page, in case of block size
less than page size). We know that the blocks will always be contiguous
on disk, so we should be able to send down large I/Os, rather than
relying on the block stack to merge them as we do at the moment, which
should be a further improvement too,

Steve.

Hi,

The rgrplvb mount option only helps if the file system is using lock_dlm.
For lock_nolock, it's still just as slow because lock_nolock has no knowledge
of lvbs. Now, granted, that's an unusual case because GFS2 is normally used
with lock_dlm.
That sounds like a bug... it should work in the same way, even with 
lock_nolock.



I like the idea of making rgrplvb the default mount option, and I don't
see a problem doing that.
The issue is that it will not be backwards compatible. We really need a 
way to at least easily detect if someone has a mixed cluster (some nodes 
with rgrplvb enabled, and some without) otherwise it might be confusing 
when we get odd reports of allocations failing, even when there appears 
to be free space.


We need to treat what we put into LVBs in the same way as we treat the 
on-disk format in terms of

backwards compatibility.



I think the rgrplvb option should be compatible with this patch, but
I'll set up a test environment in order to test that they work together
harmoniously.

I also like the idea of using a pagevec for reading in multiple pages for
the rgrps, but that's another improvement for another day. If there's
not a bugzilla record open for that, perhaps we should open one.

Regards,

Bob Peterson
Red Hat File Systems


If we have rgrplvb, then we my not need this patch, since we will not be 
looking up the rgrp's blocks as often. So we should see the benefit just 
by turning that on I think... at least it would be good to see whether 
there is any performance difference there. In cases where we have nodes 
competing for the rgrps, then the blocks will not be cached anyway, so 
we will gain no benefit from this patch, since we'll have to read the 
blocks anyway, hence my thought that speeding up the lookup is the way 
to go, since it will give the benefit for more different cases - both 
when the rgrps blocks are cached and uncached,



Steve.



Re: [Cluster-devel] [GFS2 PATCH] GFS2: Don't add rgrp glocks to the lru

2015-06-15 Thread Steven Whitehouse

Hi,

On 12/06/15 19:29, Bob Peterson wrote:

Hi,

The glocks used for resource groups often come and go hundreds of
thousands of times per second. Adding them to the lru list just
adds unnecessary contention for the lru_lock spin_lock, especially
considering we're almost certainly going to re-use the glock and
take it back off the lru microseconds later. We never want the
glock shrinker to cull them anyway. This patch bypasses adding
them to the lru.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson rpete...@redhat.com

Acked-by: Steven Whitehouse swhit...@redhat.com

Steve.

---
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 0fa8062..1db06a5 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1076,7 +1076,8 @@ void gfs2_glock_dq(struct gfs2_holder *gh)
!test_bit(GLF_DEMOTE, gl-gl_flags))
fast_path = 1;
}
-   if (!test_bit(GLF_LFLUSH, gl-gl_flags)  demote_ok(gl))
+   if (!test_bit(GLF_LFLUSH, gl-gl_flags)  demote_ok(gl) 
+   gl-gl_name.ln_type != LM_TYPE_RGRP)
gfs2_glock_add_to_lru(gl);
  
  	trace_gfs2_glock_queue(gh, 0);