I've gone through a few iterations of this patch set now with mixed reviews,
the previous attempt being a resend on 10 January.

Based on feedback, I've simplified the patch set greatly and broken
it down into two simple patches, given here. The first patch reworks
function gfs2_rgrp_used_recently() to simplify and use better calculations.
The second patch adds an "intra-node" congestion check to the existing
function which was previously only checking inter-node (DLM) congestion,
which ended up severely hurting single-node scalability.

This patch greatly improves the scalability of the GFS2 block allocator
in cases where more than one single process wants to allocate blocks.
Here are the results of a test I ran on a test machine using iozone with
a growing number of concurrent processes:

Before the two patches were applied:
        Children see throughput for  1 initial writers  =  543094.50 kB/sec
        Children see throughput for  2 initial writers  =  631279.31 kB/sec
        Children see throughput for  4 initial writers  =  618569.31 kB/sec
        Children see throughput for  6 initial writers  =  672926.77 kB/sec
        Children see throughput for  8 initial writers  =  620530.25 kB/sec
        Children see throughput for 10 initial writers  =  637743.89 kB/sec
        Children see throughput for 12 initial writers  =  625197.03 kB/sec
        Children see throughput for 14 initial writers  =  627233.04 kB/sec
        Children see throughput for 16 initial writers  =  346880.52 kB/sec

After the two patches are applied:
        Children see throughput for  1 initial writers  =  539514.88 kB/sec
        Children see throughput for  2 initial writers  =  630325.97 kB/sec
        Children see throughput for  4 initial writers  =  820960.05 kB/sec
        Children see throughput for  6 initial writers  =  773291.00 kB/sec
        Children see throughput for  8 initial writers  =  764553.85 kB/sec
        Children see throughput for 10 initial writers  =  837788.38 kB/sec
        Children see throughput for 12 initial writers  =  752443.34 kB/sec
        Children see throughput for 14 initial writers  =  781917.29 kB/sec
        Children see throughput for 16 initial writers  =  816540.99 kB/sec

With one or two processes running, the difference amounts to noise.
But even with only 4 concurrent processes, the block allocator has 25
percent better throughput with the patches. At 16 concurrent processes,
the overall throughput is more than double, although that number may
be skewed by the fact that I've got 2 sockets, each of which has 8
cores, and some of the cores are used by the driver and its monitoring.
---
Bob Peterson (2):
  GFS2: Simplify gfs2_rgrp_used_recently
  GFS2: Split gfs2_rgrp_congested into inter-node and intra-node cases

 fs/gfs2/rgrp.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 69 insertions(+), 14 deletions(-)

-- 
2.14.3

Reply via email to