Re: [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Split gfs2_rgrp_congested into inter-node and intra-node cases

Steven Whitehouse Wed, 24 Jan 2018 12:50:09 -0800


On 24/01/18 17:04, Bob Peterson wrote:

Hi,

----- Original Message -----
| | My comments in relation to the previous patch still stand. This is still
| | the wrong answer to the problem. If we have lock contention here, then
| | we should be looking to fix the locking and not trying to avoid the
| | issue by choosing a different rgrp.
| This is also not too different from how we gained performance
| boosts with the Orlov algorithm to assign unique rgrps to groups
| of files, which also spread the distance of the writes.
|
| This will not increase file fragmentation because that is all
| mitigated by the multi-block allocation algorithms we have
| in place.

I've done some more analysis of the scale-out performance with and
without these two patches, with help from Barry in the performance group.

Attached to this email is an output file that compares performance results
from a stock test kernel to the same test kernel plus my previously posted
patches for intra-node congestion. You can see from this summary grep
that the scalability is vastly better with the patch:

[root@perf82 bin]# pr -m -T -w 160 stock829.log iozone16.log | grep 'Parent'
        Parent sees throughput for  1 initial writers   =  543094.03 kB/sec     
        Parent sees throughput for  1 initial writers   =  539514.33 kB/sec
        Parent sees throughput for  2 initial writers   =  621523.39 kB/sec     
        Parent sees throughput for  2 initial writers   =  625554.26 kB/sec
        Parent sees throughput for  4 initial writers   =  244177.34 kB/sec     
        Parent sees throughput for  4 initial writers   =  803140.31 kB/sec
        Parent sees throughput for  6 initial writers   =  196014.09 kB/sec     
        Parent sees throughput for  6 initial writers   =  765439.86 kB/sec
        Parent sees throughput for  8 initial writers   =  157997.14 kB/sec     
        Parent sees throughput for  8 initial writers   =  760862.85 kB/sec
        Parent sees throughput for 10 initial writers   =  155714.45 kB/sec     
        Parent sees throughput for 10 initial writers   =  829598.76 kB/sec
        Parent sees throughput for 12 initial writers   =  133166.40 kB/sec     
        Parent sees throughput for 12 initial writers   =  744604.51 kB/sec
        Parent sees throughput for 14 initial writers   =  146353.44 kB/sec     
        Parent sees throughput for 14 initial writers   =  776214.99 kB/sec
        Parent sees throughput for 16 initial writers   =   84699.82 kB/sec     
        Parent sees throughput for 16 initial writers   =  796142.16 kB/sec

What this means is that performance of 16 iozone threads was 97.5% parallel
with my patches compared to 24.4% on the stock kernel.

Looking at the file fragmentation data in the same file, you can see that
file fragmentation is also noticeably improved in all runs with the patch
in the 16-process case:

Fragmentation Summary                                                           
Fragmentation Summary
/saswork/iozonetest1: 8622 extents found                                        
/saswork/iozonetest1: 8435 extents found
/saswork/iozonetest10: 8614 extents found                                       
/saswork/iozonetest10: 8322 extents found
/saswork/iozonetest11: 8617 extents found                                       
/saswork/iozonetest11: 8312 extents found
/saswork/iozonetest12: 8658 extents found                                       
/saswork/iozonetest12: 8319 extents found
/saswork/iozonetest13: 8469 extents found                                       
/saswork/iozonetest13: 8435 extents found
/saswork/iozonetest14: 8650 extents found                                       
/saswork/iozonetest14: 8315 extents found
/saswork/iozonetest15: 8649 extents found                                       
/saswork/iozonetest15: 8489 extents found
/saswork/iozonetest16: 8653 extents found                                       
/saswork/iozonetest16: 8436 extents found
/saswork/iozonetest2: 8464 extents found                                        
/saswork/iozonetest2: 8509 extents found
/saswork/iozonetest3: 8635 extents found                                        
/saswork/iozonetest3: 8448 extents found
/saswork/iozonetest4: 8659 extents found                                        
/saswork/iozonetest4: 9203 extents found
/saswork/iozonetest5: 8647 extents found                                        
/saswork/iozonetest5: 8434 extents found
/saswork/iozonetest6: 8669 extents found                                        
/saswork/iozonetest6: 8435 extents found
/saswork/iozonetest7: 8656 extents found                                        
/saswork/iozonetest7: 8309 extents found
/saswork/iozonetest8: 8660 extents found                                        
/saswork/iozonetest8: 8410 extents found
/saswork/iozonetest9: 8607 extents found                                        
/saswork/iozonetest9: 8435 extents found

See the attachment for more details.

And as I said in my previous email, overall file _system_ fragmentation
is, IMHO, not a big concern in the light of these performance gains.

I'll remove the reference to HZ from the first patch: good call.

I still see no reason to throw away the second patch, at least while we work
on a longer-term solution.

Regards,

Bob Peterson
Red Hat File Systems

I've no doubt that this is an issue that we need to address. Thequestion is how to do it, rather than if. However iozone is not a greatworkload for testing this kind of thing, and if that was done on a brandnew filesystem, then that will also have a significant effect on thefragmentation.

Another issue is whether the (number of nodes x number of I/O threadsper node) <= (number of rgrps) since in the case of a new filesystem,that will make a significant difference in terms of whether contentionis seen or not. The proposed new algorithm seems to assume that thiswill always be the case. If it is not the case, then we will seesignificant contention on the rgrps. If we resolve this by doing more ofthe allocation/deallocation in parallel on a single rgrp, then we onlyrequire (number of rgrps) > (number of nodes) which is a less onerousrequirement.

I'm not proposing that we should throw out the patch either, but weshould try to do things in a different order. Since the problem isreally a single node issue, in that we want to be able to better sharergrps between threads on a single node, then it should be addressed assuch. Adding a new flag to allow us to share an exclusive cluster lockamong local processes should not be tricky to do. I've attached an(untested) patch that should allow that. The new semantics are thatexclusive locks which have the GL_SHARED_EX flags on them will be sharedlocally. They will not be shared with exclusive locks that do not havethat flag, so that gives us a way to introduce the change gradually....as each glock can be audited for safety individually and the flag addedwhen we know it is safe.

It will likely be easier to start with the deallocation paths, since thecritical paths are shorter and easier to identify. We can probably justadd a suitable lock to cover the changes to the rgrp as a wrapper, andwe might also be able to shrink the coverage of that lock over time,too. It needs to cover the bitmap changes and the summary info changesso that the two are still atomic with respect to other local threadsworking on the same rgrp. If we are careful we might even be able tomake the lock a spinlock eventually, but it will probably need to be aper rgrp mutex or rwsem to start with.

There will need to be some careful audit in order to ensure that the newper rgrp lock covers the correct parts of the code, and that we minimisethose parts, in order to allow maximum parallelism. However, if we canachieve that, then we should get much greater gains in performanceoverall. Not least because this applies equally to allocation anddeallocation, and your proposed new algorithm only addresses theallocation side of things.

Once we have resolved the intra-node issues, then we can go back andtake a second look at whether we can get even more gain from inter-nodechanges. The points I made in the previous emails still stand though, inthat we should not be trying to solve an intra-node issue with aninter-node solution.

The iozone test is not a very good test case for this - the worst casewould be where we had large numbers of threads on each node, all writingout very small files. In that case we definitely want them groupedtogether for future reads and not spread over the disk in many differentrgrps. Imagine someone running a find over such a filesystem, forexample. We don't want that to degenerate into random reads, but toretain at least some locality. There will never be an algorithm thatwill be able to cope with every workload, but we should be able to getthe same result (or better!) in terms of performance without having tochange the allocation algorithm so drastically,


Steve.

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 11066d8647d2..02a4137fcdfe 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -280,9 +280,12 @@ void gfs2_glock_put(struct gfs2_glock *gl)
 static inline int may_grant(const struct gfs2_glock *gl, const struct gfs2_holder *gh)
 {
 	const struct gfs2_holder *gh_head = list_entry(gl->gl_holders.next, const struct gfs2_holder, gh_list);
+
 	if ((gh->gh_state == LM_ST_EXCLUSIVE ||
-	     gh_head->gh_state == LM_ST_EXCLUSIVE) && gh != gh_head)
-		return 0;
+	     gh_head->gh_state == LM_ST_EXCLUSIVE) && (gh != gh_head) &&
+	     ((gh->gh_flags & gh_head->gh_flags & GL_SHARED_EX) == 0)) {
+			return 0;
+	}
 	if (gl->gl_state == gh->gh_state)
 		return 1;
 	if (gh->gh_flags & GL_EXACT)
@@ -1689,6 +1692,8 @@ static const char *hflags2str(char *buf, u16 flags, unsigned long iflags)
 		*p++ = 'E';
 	if (flags & GL_NOCACHE)
 		*p++ = 'c';
+	if (flags & GL_SHARED_EX)
+		*p++ = 's';
 	if (test_bit(HIF_HOLDER, &iflags))
 		*p++ = 'H';
 	if (test_bit(HIF_WAIT, &iflags))
diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index 5e12220cc0c2..0841915490e3 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -89,6 +89,7 @@ enum {
 #define GL_EXACT		0x0080
 #define GL_SKIP			0x0100
 #define GL_NOCACHE		0x0400
+#define GL_SHARED_EX		0x0800
   
 /*
  * lm_async_cb return flags

Re: [Cluster-devel] [GFS2 PATCH 2/2] GFS2: Split gfs2_rgrp_congested into inter-node and intra-node cases

Reply via email to