Resending - previously, I forgot to send the message in the plain text
mode. I'm sorry.

Hi Mel,

Thanks a lot for looking into this!

Our results are mostly in line with what you see. We observe big gains
(20-50%) when the system is loaded to 1/3 of the maximum capacity and
mixed results at the full load - some workloads benefit from the patch
at the full load, others not, but performance changes at the full load
are mostly within the noise of results (+/-5%). Overall, we think this
patch is helpful.

Jirka


On Mon, Sep 21, 2020 at 1:02 PM Mel Gorman <mgor...@techsingularity.net> wrote:
>
> On Thu, Sep 10, 2020 at 11:50:04PM +0200, Jirka Hladky wrote:
> > 1) Threads stability has improved a lot. We see much fewer threads
> > migrations. Check for example this heatmap based on the mpstat results
> > collected while running sp.C test from the NAS benchmark. sp.C was run
> > with 16 threads on a dual-socket AMD 7351 server with 8 NUMA nodes:
> > 5.9 vanilla 
> > https://drive.google.com/file/d/10rojhTSQUSu-1aGQi-srr99SmVQ3Llgo/view?usp=sharing
> > 5.9 with the patch
> > https://drive.google.com/file/d/1eZdTBWvWMNvdvXqitwAlcKZ7gb-ySQUl/view?usp=sharing
> >
> > The heatmaps are generated from the mpstat output (there are 5
> > different runs at one picture). We collect CPU utilization every 5
> > seconds. Lighter color corresponds to lower CPU utilization. Light
> > color means that the thread may have run on different CPUs during that
> > 5 seconds. Solid straight lines, on the other hand, mean that thread
> > was running on the same CPU all the time. The difference is striking.
> >
> > We see significantly fewer threads migrations across many different
> > tests - NAS, SPECjbb2005, SPECjvm2008
> >
>
> For all three, I see the point where it's better or worse depends on
> overall activity. I looked at heatmaps for a variety of workloads and
> visually the bigger differences tend to be with utilisation is relatively
> low (e.g. one third of CPUs active).
>
> > 2) We see also performance improvement in terms of runtime, especially
> > at low load scenarios (number of threads being roughly equal to the 2*
> > number of NUMA nodes). It fixes the performance drop which we see
> > since 5.7 kernel, discussed for example here:
> > https://lore.kernel.org/lkml/CAE4VaGB7+sR1nf3Ux8W=hgn46gnxryr0ubwju0oynk7h00y...@mail.gmail.com/
> >
>
> This would make some sense given the original intent about allowing
> imbalances that was later revised significantly. It depends on when
> memory throughput is more important so the impact varies with the level
> of untilisation.  For example, on a 2-socket machine running specjvm 2005
> I see
>
> specjbb
>                                5.9.0-rc4              5.9.0-rc4
>                                  vanilla        dstbalance-v1r1
> Hmean     tput-1     46425.00 (   0.00%)    43394.00 *  -6.53%*
> Hmean     tput-2     98416.00 (   0.00%)    96031.00 *  -2.42%*
> Hmean     tput-3    150184.00 (   0.00%)   148783.00 *  -0.93%*
> Hmean     tput-4    200683.00 (   0.00%)   197906.00 *  -1.38%*
> Hmean     tput-5    236305.00 (   0.00%)   245549.00 *   3.91%*
> Hmean     tput-6    281559.00 (   0.00%)   285692.00 *   1.47%*
> Hmean     tput-7    338558.00 (   0.00%)   334467.00 *  -1.21%*
> Hmean     tput-8    340745.00 (   0.00%)   372501.00 *   9.32%*
> Hmean     tput-9    424343.00 (   0.00%)   413006.00 *  -2.67%*
> Hmean     tput-10   421854.00 (   0.00%)   434261.00 *   2.94%*
> Hmean     tput-11   493256.00 (   0.00%)   485330.00 *  -1.61%*
> Hmean     tput-12   549573.00 (   0.00%)   529959.00 *  -3.57%*
> Hmean     tput-13   593183.00 (   0.00%)   555010.00 *  -6.44%*
> Hmean     tput-14   588252.00 (   0.00%)   599166.00 *   1.86%*
> Hmean     tput-15   623065.00 (   0.00%)   642713.00 *   3.15%*
> Hmean     tput-16   703924.00 (   0.00%)   660758.00 *  -6.13%*
> Hmean     tput-17   666023.00 (   0.00%)   697675.00 *   4.75%*
> Hmean     tput-18   761502.00 (   0.00%)   758360.00 *  -0.41%*
> Hmean     tput-19   796088.00 (   0.00%)   798368.00 *   0.29%*
> Hmean     tput-20   733564.00 (   0.00%)   823086.00 *  12.20%*
> Hmean     tput-21   840980.00 (   0.00%)   856711.00 *   1.87%*
> Hmean     tput-22   804285.00 (   0.00%)   872238.00 *   8.45%*
> Hmean     tput-23   795208.00 (   0.00%)   889374.00 *  11.84%*
> Hmean     tput-24   848619.00 (   0.00%)   966783.00 *  13.92%*
> Hmean     tput-25   750848.00 (   0.00%)   903790.00 *  20.37%*
> Hmean     tput-26   780523.00 (   0.00%)   962254.00 *  23.28%*
> Hmean     tput-27  1042245.00 (   0.00%)   991544.00 *  -4.86%*
> Hmean     tput-28  1090580.00 (   0.00%)  1035926.00 *  -5.01%*
> Hmean     tput-29   999483.00 (   0.00%)  1082948.00 *   8.35%*
> Hmean     tput-30  1098663.00 (   0.00%)  1113427.00 *   1.34%*
> Hmean     tput-31  1125671.00 (   0.00%)  1134175.00 *   0.76%*
> Hmean     tput-32   968167.00 (   0.00%)  1250286.00 *  29.14%*
> Hmean     tput-33  1077676.00 (   0.00%)  1060893.00 *  -1.56%*
> Hmean     tput-34  1090538.00 (   0.00%)  1090933.00 *   0.04%*
> Hmean     tput-35   967058.00 (   0.00%)  1107421.00 *  14.51%*
> Hmean     tput-36  1051745.00 (   0.00%)  1210663.00 *  15.11%*
> Hmean     tput-37  1019465.00 (   0.00%)  1351446.00 *  32.56%*
> Hmean     tput-38  1083102.00 (   0.00%)  1064541.00 *  -1.71%*
> Hmean     tput-39  1232990.00 (   0.00%)  1303623.00 *   5.73%*
> Hmean     tput-40  1175542.00 (   0.00%)  1340943.00 *  14.07%*
> Hmean     tput-41  1127826.00 (   0.00%)  1339492.00 *  18.77%*
> Hmean     tput-42  1198313.00 (   0.00%)  1411023.00 *  17.75%*
> Hmean     tput-43  1163733.00 (   0.00%)  1228253.00 *   5.54%*
> Hmean     tput-44  1305562.00 (   0.00%)  1357886.00 *   4.01%*
> Hmean     tput-45  1326752.00 (   0.00%)  1406061.00 *   5.98%*
> Hmean     tput-46  1339424.00 (   0.00%)  1418451.00 *   5.90%*
> Hmean     tput-47  1415057.00 (   0.00%)  1381570.00 *  -2.37%*
> Hmean     tput-48  1392003.00 (   0.00%)  1421167.00 *   2.10%*
> Hmean     tput-49  1408374.00 (   0.00%)  1418659.00 *   0.73%*
> Hmean     tput-50  1359822.00 (   0.00%)  1391070.00 *   2.30%*
> Hmean     tput-51  1414246.00 (   0.00%)  1392679.00 *  -1.52%*
> Hmean     tput-52  1432352.00 (   0.00%)  1354020.00 *  -5.47%*
> Hmean     tput-53  1387563.00 (   0.00%)  1409563.00 *   1.59%*
> Hmean     tput-54  1406420.00 (   0.00%)  1388711.00 *  -1.26%*
> Hmean     tput-55  1438804.00 (   0.00%)  1387472.00 *  -3.57%*
> Hmean     tput-56  1399465.00 (   0.00%)  1400296.00 *   0.06%*
> Hmean     tput-57  1428132.00 (   0.00%)  1396399.00 *  -2.22%*
> Hmean     tput-58  1432385.00 (   0.00%)  1386253.00 *  -3.22%*
> Hmean     tput-59  1421612.00 (   0.00%)  1371416.00 *  -3.53%*
> Hmean     tput-60  1429423.00 (   0.00%)  1389412.00 *  -2.80%*
> Hmean     tput-61  1396230.00 (   0.00%)  1351122.00 *  -3.23%*
> Hmean     tput-62  1418396.00 (   0.00%)  1383098.00 *  -2.49%*
> Hmean     tput-63  1409918.00 (   0.00%)  1374662.00 *  -2.50%*
> Hmean     tput-64  1410236.00 (   0.00%)  1376216.00 *  -2.41%*
> Hmean     tput-65  1396405.00 (   0.00%)  1364418.00 *  -2.29%*
> Hmean     tput-66  1395975.00 (   0.00%)  1357326.00 *  -2.77%*
> Hmean     tput-67  1392986.00 (   0.00%)  1349642.00 *  -3.11%*
> Hmean     tput-68  1386541.00 (   0.00%)  1343261.00 *  -3.12%*
> Hmean     tput-69  1374407.00 (   0.00%)  1342588.00 *  -2.32%*
> Hmean     tput-70  1377513.00 (   0.00%)  1334654.00 *  -3.11%*
> Hmean     tput-71  1369319.00 (   0.00%)  1334952.00 *  -2.51%*
> Hmean     tput-72  1354635.00 (   0.00%)  1329005.00 *  -1.89%*
> Hmean     tput-73  1350933.00 (   0.00%)  1318942.00 *  -2.37%*
> Hmean     tput-74  1351714.00 (   0.00%)  1316347.00 *  -2.62%*
> Hmean     tput-75  1352198.00 (   0.00%)  1309974.00 *  -3.12%*
> Hmean     tput-76  1349490.00 (   0.00%)  1286064.00 *  -4.70%*
> Hmean     tput-77  1336131.00 (   0.00%)  1303684.00 *  -2.43%*
> Hmean     tput-78  1308896.00 (   0.00%)  1271024.00 *  -2.89%*
> Hmean     tput-79  1326703.00 (   0.00%)  1290862.00 *  -2.70%*
> Hmean     tput-80  1336199.00 (   0.00%)  1291629.00 *  -3.34%*
>
> Note as the utilisation reaches roughly half the CPUs that performance
> is much better but then degrades for higher thread counts. Part of this
> is that NUMA balancing activity is higher with the patch once roughly a
> quarter of the CPUs are in use.
>
> NAS is all over the map depending on the compute kernel and degree of
> parallelisation. Sometimes it is much faster depending on the machine
> and the degree of parallelisation, other times slower. Generally the
> level of NUMA balancing activity is variable -- sometimes better,
> sometimes worse and locality is generally lower (which is not a solid
> conclusion as it's usually correlated with differences in hinting
> faults).
>
> There are counter-examples. On EPYC 2 running hackbench, locality and
> scan activity is much worse with the patch. On hackbench using processes
> and sockets, it is about 8% slower for small numbers of groups and 28%
> slower for larger numbers of groups. On the same workload, NUMA activity
> is much higher, locality goes from 50% to 1.5%. Bizarrely, with pipes
> the patch is much better except in the opposite direction.
>
> For something like dbench, it depends on the machine. All but one
> machine showed an improvement. On EPYC 2, it's much worse.
>
> > The biggest improvements are for the NAS and the SPECjvm2008
> > benchmarks (typically between 20-50%). SPECjbb2005 shows also
> > improvements, around 10%. This is again on NUMA servers at the low
> > utilization. You can find snapshots of some results here:
> > https://drive.google.com/drive/folders/1k3Gb-vlI7UjPZZcLkoL2W2VB_zqxIJ3_?usp=sharing
> >
> > @Mel - could you please test the proposed patch? I know you have good
> > coverage for the inter-process communication benchmarks which may show
> > different behavior than benchmarks which we use. I hope that fewer
> > threads migrations might show positive effects also for these tests.
> > Please give it a try.
> >
>
> Basically I see mixed bag depending on the workload, machine and overall
> levels of utilisation. Sometimes it's better (sometimes much better),
> other times it is worse (sometimes much worse). Given that there isn't a
> universally good decision in this section and more people seem to prefer
> the patch then it may be best to keep the LB decisions consistent and
> revisit imbalance handling when the load balancer code changes settle
> down so
>
> Acked-by: Mel Gorman <mgor...@techsingularity.net>
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka

Reply via email to