On 24/01/18 17:04, Bob Peterson wrote:
Hi,
----- Original Message -----
| | My comments in relation to the previous patch still stand. This is still
| | the wrong answer to the problem. If we have lock contention here, then
| | we should be looking to fix the locking and not trying to avoid the
| | issue by choosing a different rgrp.
| This is also not too different from how we gained performance
| boosts with the Orlov algorithm to assign unique rgrps to groups
| of files, which also spread the distance of the writes.
|
| This will not increase file fragmentation because that is all
| mitigated by the multi-block allocation algorithms we have
| in place.
I've done some more analysis of the scale-out performance with and
without these two patches, with help from Barry in the performance group.
Attached to this email is an output file that compares performance results
from a stock test kernel to the same test kernel plus my previously posted
patches for intra-node congestion. You can see from this summary grep
that the scalability is vastly better with the patch:
[root@perf82 bin]# pr -m -T -w 160 stock829.log iozone16.log | grep 'Parent'
Parent sees throughput for 1 initial writers = 543094.03 kB/sec
Parent sees throughput for 1 initial writers = 539514.33 kB/sec
Parent sees throughput for 2 initial writers = 621523.39 kB/sec
Parent sees throughput for 2 initial writers = 625554.26 kB/sec
Parent sees throughput for 4 initial writers = 244177.34 kB/sec
Parent sees throughput for 4 initial writers = 803140.31 kB/sec
Parent sees throughput for 6 initial writers = 196014.09 kB/sec
Parent sees throughput for 6 initial writers = 765439.86 kB/sec
Parent sees throughput for 8 initial writers = 157997.14 kB/sec
Parent sees throughput for 8 initial writers = 760862.85 kB/sec
Parent sees throughput for 10 initial writers = 155714.45 kB/sec
Parent sees throughput for 10 initial writers = 829598.76 kB/sec
Parent sees throughput for 12 initial writers = 133166.40 kB/sec
Parent sees throughput for 12 initial writers = 744604.51 kB/sec
Parent sees throughput for 14 initial writers = 146353.44 kB/sec
Parent sees throughput for 14 initial writers = 776214.99 kB/sec
Parent sees throughput for 16 initial writers = 84699.82 kB/sec
Parent sees throughput for 16 initial writers = 796142.16 kB/sec
What this means is that performance of 16 iozone threads was 97.5% parallel
with my patches compared to 24.4% on the stock kernel.
Looking at the file fragmentation data in the same file, you can see that
file fragmentation is also noticeably improved in all runs with the patch
in the 16-process case:
Fragmentation Summary
Fragmentation Summary
/saswork/iozonetest1: 8622 extents found
/saswork/iozonetest1: 8435 extents found
/saswork/iozonetest10: 8614 extents found
/saswork/iozonetest10: 8322 extents found
/saswork/iozonetest11: 8617 extents found
/saswork/iozonetest11: 8312 extents found
/saswork/iozonetest12: 8658 extents found
/saswork/iozonetest12: 8319 extents found
/saswork/iozonetest13: 8469 extents found
/saswork/iozonetest13: 8435 extents found
/saswork/iozonetest14: 8650 extents found
/saswork/iozonetest14: 8315 extents found
/saswork/iozonetest15: 8649 extents found
/saswork/iozonetest15: 8489 extents found
/saswork/iozonetest16: 8653 extents found
/saswork/iozonetest16: 8436 extents found
/saswork/iozonetest2: 8464 extents found
/saswork/iozonetest2: 8509 extents found
/saswork/iozonetest3: 8635 extents found
/saswork/iozonetest3: 8448 extents found
/saswork/iozonetest4: 8659 extents found
/saswork/iozonetest4: 9203 extents found
/saswork/iozonetest5: 8647 extents found
/saswork/iozonetest5: 8434 extents found
/saswork/iozonetest6: 8669 extents found
/saswork/iozonetest6: 8435 extents found
/saswork/iozonetest7: 8656 extents found
/saswork/iozonetest7: 8309 extents found
/saswork/iozonetest8: 8660 extents found
/saswork/iozonetest8: 8410 extents found
/saswork/iozonetest9: 8607 extents found
/saswork/iozonetest9: 8435 extents found
See the attachment for more details.
And as I said in my previous email, overall file _system_ fragmentation
is, IMHO, not a big concern in the light of these performance gains.
I'll remove the reference to HZ from the first patch: good call.
I still see no reason to throw away the second patch, at least while we work
on a longer-term solution.
Regards,
Bob Peterson
Red Hat File Systems
I've no doubt that this is an issue that we need to address. The
question is how to do it, rather than if. However iozone is not a great
workload for testing this kind of thing, and if that was done on a brand
new filesystem, then that will also have a significant effect on the
fragmentation.
Another issue is whether the (number of nodes x number of I/O threads
per node) <= (number of rgrps) since in the case of a new filesystem,
that will make a significant difference in terms of whether contention
is seen or not. The proposed new algorithm seems to assume that this
will always be the case. If it is not the case, then we will see
significant contention on the rgrps. If we resolve this by doing more of
the allocation/deallocation in parallel on a single rgrp, then we only
require (number of rgrps) > (number of nodes) which is a less onerous
requirement.
I'm not proposing that we should throw out the patch either, but we
should try to do things in a different order. Since the problem is
really a single node issue, in that we want to be able to better share
rgrps between threads on a single node, then it should be addressed as
such. Adding a new flag to allow us to share an exclusive cluster lock
among local processes should not be tricky to do. I've attached an
(untested) patch that should allow that. The new semantics are that
exclusive locks which have the GL_SHARED_EX flags on them will be shared
locally. They will not be shared with exclusive locks that do not have
that flag, so that gives us a way to introduce the change gradually....
as each glock can be audited for safety individually and the flag added
when we know it is safe.
It will likely be easier to start with the deallocation paths, since the
critical paths are shorter and easier to identify. We can probably just
add a suitable lock to cover the changes to the rgrp as a wrapper, and
we might also be able to shrink the coverage of that lock over time,
too. It needs to cover the bitmap changes and the summary info changes
so that the two are still atomic with respect to other local threads
working on the same rgrp. If we are careful we might even be able to
make the lock a spinlock eventually, but it will probably need to be a
per rgrp mutex or rwsem to start with.
There will need to be some careful audit in order to ensure that the new
per rgrp lock covers the correct parts of the code, and that we minimise
those parts, in order to allow maximum parallelism. However, if we can
achieve that, then we should get much greater gains in performance
overall. Not least because this applies equally to allocation and
deallocation, and your proposed new algorithm only addresses the
allocation side of things.
Once we have resolved the intra-node issues, then we can go back and
take a second look at whether we can get even more gain from inter-node
changes. The points I made in the previous emails still stand though, in
that we should not be trying to solve an intra-node issue with an
inter-node solution.
The iozone test is not a very good test case for this - the worst case
would be where we had large numbers of threads on each node, all writing
out very small files. In that case we definitely want them grouped
together for future reads and not spread over the disk in many different
rgrps. Imagine someone running a find over such a filesystem, for
example. We don't want that to degenerate into random reads, but to
retain at least some locality. There will never be an algorithm that
will be able to cope with every workload, but we should be able to get
the same result (or better!) in terms of performance without having to
change the allocation algorithm so drastically,
Steve.
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 11066d8647d2..02a4137fcdfe 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -280,9 +280,12 @@ void gfs2_glock_put(struct gfs2_glock *gl)
static inline int may_grant(const struct gfs2_glock *gl, const struct gfs2_holder *gh)
{
const struct gfs2_holder *gh_head = list_entry(gl->gl_holders.next, const struct gfs2_holder, gh_list);
+
if ((gh->gh_state == LM_ST_EXCLUSIVE ||
- gh_head->gh_state == LM_ST_EXCLUSIVE) && gh != gh_head)
- return 0;
+ gh_head->gh_state == LM_ST_EXCLUSIVE) && (gh != gh_head) &&
+ ((gh->gh_flags & gh_head->gh_flags & GL_SHARED_EX) == 0)) {
+ return 0;
+ }
if (gl->gl_state == gh->gh_state)
return 1;
if (gh->gh_flags & GL_EXACT)
@@ -1689,6 +1692,8 @@ static const char *hflags2str(char *buf, u16 flags, unsigned long iflags)
*p++ = 'E';
if (flags & GL_NOCACHE)
*p++ = 'c';
+ if (flags & GL_SHARED_EX)
+ *p++ = 's';
if (test_bit(HIF_HOLDER, &iflags))
*p++ = 'H';
if (test_bit(HIF_WAIT, &iflags))
diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index 5e12220cc0c2..0841915490e3 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -89,6 +89,7 @@ enum {
#define GL_EXACT 0x0080
#define GL_SKIP 0x0100
#define GL_NOCACHE 0x0400
+#define GL_SHARED_EX 0x0800
/*
* lm_async_cb return flags