Thanks, Jed, It is fascinating. I will try to check if I can do anything to have this kind of improvement as well.
Thanks, Fande, On Fri, Jun 12, 2020 at 7:43 PM Jed Brown <[email protected]> wrote: > Jed Brown <[email protected]> writes: > > > Fande Kong <[email protected]> writes: > > > >>> There's a lot more to AMG setup than memory bandwidth (architecture > >>> matters a lot, even between different generation CPUs). > >> > >> > >> Could you elaborate a bit more on this? From my understanding, one big > part > >> of AMG SetUp is RAP that should be pretty much bandwidth. > > > > The RAP isn't "pretty much bandwidth". See below for some > > Skylake/POWER9/EPYC results and analysis (copied from an off-list > > thread). I'll leave in some other bandwidth comments that may or may > > not be relevant to you. The short story is that Skylake and EPYC are > > both much better than POWER9 at MatPtAP despite POWER9 having similar > > bandwidth as EPYC and thus being significantly faster than Skylake for > > MatMult/smoothing. > > > > > > Jed Brown <[email protected]> writes: > > > >> I'm attaching a log from my machine (Noether), which is 2-socket EPYC > >> 7452 (32 cores each). Each socket has 8xDDR4-3200 and 128 MB of L3 > >> cache. This is the same node architecture as the new BER/E3SM machine > >> being installed at Argonne (though that one will probably have > >> higher-clocked and/or more cores per socket). Note that these CPUs are > >> about $2k each while Skylake 8180 are about $10k. > >> > >> Some excerpts/comments below. > >> > > > > [...] > > > > In addition to the notes below, I'd like to call out how important > > streaming stores are on EPYC. With vanilla code or _mm256_store_pd, we > > get the following performance > > > > $ mpiexec -n 64 --bind-to core --map-by core:1 > src/benchmarks/streams/MPIVersion > > Copy 162609.2392 Scale 159119.8259 Add 174687.6250 Triad > 175840.1587 > > > > but replacing _mm256_store_pd with _mm256_stream_pd gives this > > > > $ mpiexec -n 64 --bind-to core --map-by core:1 > src/benchmarks/streams/MPIVersion > > Copy 259951.9936 Scale 259381.0589 Add 250216.3389 Triad > 249292.9701 > > I turned on NPS4 (a BIOS setting that creates a NUMA node for each pair > of memory channels) and get a modest performance boost. > > $ mpiexec -n 64 --bind-to core --map-by core:1 > src/benchmarks/streams/MPIVersion > > Copy 289645.3776 Scale 289186.2783 Add 273220.0133 Triad 272911.2263 > > On this architecture, best performance comes from one process per 4-core > CCX (shared L3). > > $ mpiexec -n 16 --bind-to core --map-by core:4 > src/benchmarks/streams/MPIVersion > > Copy 300704.8859 Scale 304556.3380 Add 295970.1132 Triad 298891.3821 > > > This is just preposterously huge, but very repeatable using gcc and > > clang, and inspecting the assembly. This suggests that it would be > > useful for vector kernels to have streaming and non-streaming variants. > > That is, if I drop the vector length by 20 (so the working set is 2.3 > > MB/core instead of 46 MB in the default version), then we get 2.4 TB/s > > Triad with _mm256_store_pd: > > > > $ mpiexec -n 64 --bind-to core --map-by core:1 > src/benchmarks/streams/MPIVersion > > Copy 2159915.7058 Scale 2212671.7087 Add 2414758.2757 Triad > 2402671.1178 > > > > and a thoroughly embarrassing 353 GB/s with _mm256_stream_pd: > > > > $ mpiexec -n 64 --bind-to core --map-by core:1 > src/benchmarks/streams/MPIVersion > > Copy 235934.6653 Scale 237446.8507 Add 352805.7288 Triad > 352992.9692 > > > > > > I don't know a good way to automatically determine whether to expect the > > memory to be in cache, but we could make it a global (or per-object) > > run-time selection. > > > >> Jed Brown <[email protected]> writes: > >> > >>> "Smith, Barry F." <[email protected]> writes: > >>> > >>>> Thanks. The PowerPC is pretty crappy compared to Skylake. > >>> > >>> Compare the MGSmooth times. The POWER9 is faster than the Skylake > >>> because it has more memory bandwidth. > >>> > >>> $ rg 'MGInterp Level 4|MGSmooth Level 4' ex56* > >>> ex56-JLSE-skylake-56ranks-converged.txt > >>> 254:MGSmooth Level 4 68 1.0 1.8808e+00 1.2 7.93e+08 1.3 3.6e+04 > 1.9e+04 3.4e+01 8 29 10 16 3 62 60 18 54 25 22391 > >>> 256:MGInterp Level 4 68 1.0 4.0043e-01 1.8 1.45e+08 1.3 2.2e+04 > 2.5e+03 0.0e+00 1 5 6 1 0 9 11 11 4 0 19109 > >>> > >>> ex56-summit-cpu-36ranks-converged.txt > >>> 265:MGSmooth Level 4 68 1.0 1.1531e+00 1.1 1.22e+09 1.2 2.3e+04 > 2.6e+04 3.4e+01 3 29 7 13 3 61 60 12 54 25 36519 0 0 > 0.00e+00 0 0.00e+00 0 > >>> 267:MGInterp Level 4 68 1.0 2.0749e-01 1.1 2.23e+08 1.2 1.4e+04 > 3.4e+03 0.0e+00 0 5 4 1 0 11 11 7 4 0 36925 0 0 > 0.00e+00 0 0.00e+00 0 > >>> > >>> ex56-summit-gpu-24ranks-converged.txt > >>> 275:MGSmooth Level 4 68 1.0 1.4499e-01 1.2 1.85e+09 1.2 1.0e+04 > 5.3e+04 3.4e+01 0 29 7 13 3 26 60 12 55 25 299156 940881 115 > 2.46e+01 116 8.64e+01 100 > >>> 277:MGInterp Level 4 68 1.0 1.7674e-01 1.0 3.23e+08 1.2 6.1e+03 > 6.7e+03 0.0e+00 0 5 4 1 0 33 11 7 4 0 42715 621223 36 > 2.98e+01 136 3.95e+00 100 > >>> > >>> ex56-summit-gpu-36ranks-converged.txt > >>> 275:MGSmooth Level 4 68 1.0 1.4877e-01 1.2 1.25e+09 1.2 2.3e+04 > 2.6e+04 3.4e+01 0 29 7 13 3 19 60 12 54 25 291548 719522 115 > 1.83e+01 116 5.80e+01 100 > >>> 277:MGInterp Level 4 68 1.0 2.4317e-01 1.0 2.20e+08 1.2 1.4e+04 > 3.4e+03 0.0e+00 0 5 4 1 0 33 11 7 4 0 31062 586044 36 > 1.99e+01 136 2.82e+00 100 > >> > >> 258:MGSmooth Level 4 68 1.0 9.6950e-01 1.3 6.15e+08 1.3 4.0e+04 > 1.4e+04 2.0e+00 6 28 10 15 0 59 59 18 54 25 39423 > >> 260:MGInterp Level 4 68 1.0 2.5707e-01 1.5 1.23e+08 1.2 2.7e+04 > 1.9e+03 0.0e+00 1 5 7 1 0 13 12 12 5 0 29294 > >> > >> Epyc is faster than Power9 is faster than Sklake. > >> > >>> > >>> The Skylake is a lot faster at PtAP. It'd be interesting to better > >>> understand why. Perhaps it has to do with caching or aggressiveness of > >>> out-of-order execution. > >>> > >>> $ rg 'PtAP' ex56* > >>> ex56-JLSE-skylake-56ranks-converged.txt > >>> 164:MatPtAP 4 1.0 1.4214e+00 1.0 3.94e+08 1.5 1.1e+04 > 7.4e+04 4.4e+01 6 13 3 20 4 8 28 8 39 5 13754 > >>> 165:MatPtAPSymbolic 4 1.0 8.3981e-01 1.0 0.00e+00 0.0 6.5e+03 > 7.3e+04 2.8e+01 4 0 2 12 2 5 0 5 23 3 0 > >>> 166:MatPtAPNumeric 4 1.0 5.8402e-01 1.0 3.94e+08 1.5 4.5e+03 > 7.5e+04 1.6e+01 2 13 1 8 1 3 28 3 16 2 33474 > >>> > >>> ex56-summit-cpu-36ranks-converged.txt > >>> 164:MatPtAP 4 1.0 3.9077e+00 1.0 5.89e+08 1.4 1.6e+04 > 7.4e+04 4.4e+01 9 13 5 26 4 11 28 12 46 5 4991 0 0 > 0.00e+00 0 0.00e+00 0 > >>> 165:MatPtAPSymbolic 4 1.0 1.9525e+00 1.0 0.00e+00 0.0 1.2e+04 > 7.3e+04 2.8e+01 5 0 4 19 3 5 0 9 34 3 0 0 0 > 0.00e+00 0 0.00e+00 0 > >>> 166:MatPtAPNumeric 4 1.0 1.9621e+00 1.0 5.89e+08 1.4 4.0e+03 > 7.5e+04 1.6e+01 5 13 1 7 1 5 28 3 12 2 9940 0 0 > 0.00e+00 0 0.00e+00 0 > >>> > >>> ex56-summit-gpu-24ranks-converged.txt > >>> 167:MatPtAP 4 1.0 5.7210e+00 1.0 8.48e+08 1.3 7.5e+03 > 1.3e+05 4.4e+01 8 13 5 25 4 11 28 12 46 5 3415 0 16 > 3.36e+01 4 6.30e-02 0 > >>> 168:MatPtAPSymbolic 4 1.0 2.8717e+00 1.0 0.00e+00 0.0 5.5e+03 > 1.3e+05 2.8e+01 4 0 4 19 3 5 0 9 34 3 0 0 0 > 0.00e+00 0 0.00e+00 0 > >>> 169:MatPtAPNumeric 4 1.0 2.8537e+00 1.0 8.48e+08 1.3 2.0e+03 > 1.3e+05 1.6e+01 4 13 1 7 1 5 28 3 12 2 6846 0 16 > 3.36e+01 4 6.30e-02 0 > >>> > >>> ex56-summit-gpu-36ranks-converged.txt > >>> 167:MatPtAP 4 1.0 4.0340e+00 1.0 5.89e+08 1.4 1.6e+04 > 7.4e+04 4.4e+01 8 13 5 26 4 11 28 12 46 5 4835 0 16 > 2.30e+01 4 5.18e-02 0 > >>> 168:MatPtAPSymbolic 4 1.0 2.0355e+00 1.0 0.00e+00 0.0 1.2e+04 > 7.3e+04 2.8e+01 4 0 4 19 3 5 0 9 34 3 0 0 0 > 0.00e+00 0 0.00e+00 0 > >>> 169:MatPtAPNumeric 4 1.0 2.0050e+00 1.0 5.89e+08 1.4 4.0e+03 > 7.5e+04 1.6e+01 4 13 1 7 1 5 28 3 12 2 9728 0 16 > 2.30e+01 4 5.18e-02 0 > >> > >> 153:MatPtAPSymbolic 4 1.0 7.6053e-01 1.0 0.00e+00 0.0 7.6e+03 > 5.8e+04 2.8e+01 5 0 2 12 2 6 0 5 22 3 0 > >> 154:MatPtAPNumeric 4 1.0 6.5172e-01 1.0 3.21e+08 1.4 6.4e+03 > 4.8e+04 2.4e+01 4 14 2 8 2 5 27 4 16 2 28861 > >> > >> Epyc similar to Skylake here. > >> > >>> I'd really like to compare an EPYC for these operations. I bet it's > >>> pretty good. (More bandwidth than Skylake, bigger caches, but no > >>> AVX512.) > >>> > >>>> So the biggest consumer is MatPtAP I guess that should be done > first. > >>>> > >>>> It would be good to have these results exclude the Jacobian and > Function evaluation which really dominate the time and add clutter making > it difficult to see the problems with the rest of SNESSolve. > >>>> > >>>> > >>>> Did you notice: > >>>> > >>>> MGInterp Level 4 68 1.0 1.7674e-01 1.0 3.23e+08 1.2 6.1e+03 > 6.7e+03 0.0e+00 0 5 4 1 0 33 11 7 4 0 42715 621223 36 > 2.98e+01 136 3.95e+00 100 > >>>> > >>>> it is terrible! Well over half of the KSPSolve time is in this one > relatively minor routine. All of the interps are terribly slow. Is it > related to the transpose multiple or something? > >>> > >>> Yes, it's definitely the MatMultTranspose, which must be about 3x more > >>> expensive than restriction even on the CPU. PCMG/PCGAMG should > >>> explicitly transpose (unless the user sets an option to aggressively > >>> minimize memory usage). > >>> > >>> $ rg 'MGInterp|MultTrans' ex56* > >>> ex56-JLSE-skylake-56ranks-converged.txt > >>> 222:MatMultTranspose 136 1.0 3.5105e-01 3.7 7.91e+07 1.3 2.5e+04 > 1.3e+03 0.0e+00 1 3 7 1 0 5 6 13 3 0 11755 > >>> 247:MGInterp Level 1 68 1.0 3.3894e-04 2.2 2.35e+05 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 693 > >>> 250:MGInterp Level 2 68 1.0 1.1212e-0278.0 1.17e+06 0.0 1.8e+03 > 7.7e+02 0.0e+00 0 0 1 0 0 0 0 1 0 0 2172 > >>> 253:MGInterp Level 3 68 1.0 6.7105e-02 5.3 1.23e+07 1.8 2.7e+04 > 4.2e+02 0.0e+00 0 0 8 0 0 1 1 14 1 0 8594 > >>> 256:MGInterp Level 4 68 1.0 4.0043e-01 1.8 1.45e+08 1.3 2.2e+04 > 2.5e+03 0.0e+00 1 5 6 1 0 9 11 11 4 0 19109 > >>> > >>> ex56-summit-cpu-36ranks-converged.txt > >>> 229:MatMultTranspose 136 1.0 1.4832e-01 1.4 1.21e+08 1.2 1.9e+04 > 1.5e+03 0.0e+00 0 3 6 1 0 6 6 10 3 0 27842 0 0 > 0.00e+00 0 0.00e+00 0 > >>> 258:MGInterp Level 1 68 1.0 2.9145e-04 1.5 1.08e+05 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 370 0 0 > 0.00e+00 0 0.00e+00 0 > >>> 261:MGInterp Level 2 68 1.0 5.7095e-03 1.5 9.16e+05 2.5 2.4e+03 > 7.1e+02 0.0e+00 0 0 1 0 0 0 0 1 0 0 4093 0 0 > 0.00e+00 0 0.00e+00 0 > >>> 264:MGInterp Level 3 68 1.0 3.5654e-02 2.8 1.77e+07 1.5 2.3e+04 > 3.9e+02 0.0e+00 0 0 7 0 0 1 1 12 1 0 16095 0 0 > 0.00e+00 0 0.00e+00 0 > >>> 267:MGInterp Level 4 68 1.0 2.0749e-01 1.1 2.23e+08 1.2 1.4e+04 > 3.4e+03 0.0e+00 0 5 4 1 0 11 11 7 4 0 36925 0 0 > 0.00e+00 0 0.00e+00 0 > >>> > >>> ex56-summit-gpu-24ranks-converged.txt > >>> 236:MatMultTranspose 136 1.0 2.1445e-01 1.0 1.72e+08 1.2 9.5e+03 > 2.6e+03 0.0e+00 0 3 6 1 0 39 6 11 3 0 18719 451131 8 > 3.11e+01 272 2.19e+00 100 > >>> 268:MGInterp Level 1 68 1.0 4.0388e-03 2.8 1.08e+05 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 27 79 37 > 5.84e-04 68 6.80e-05 100 > >>> 271:MGInterp Level 2 68 1.0 2.9033e-02 2.9 1.25e+06 1.9 1.6e+03 > 7.8e+02 0.0e+00 0 0 1 0 0 5 0 2 0 0 812 11539 36 > 1.14e-01 136 5.41e-02 100 > >>> 274:MGInterp Level 3 68 1.0 4.9503e-02 1.1 2.50e+07 1.4 1.1e+04 > 6.3e+02 0.0e+00 0 0 7 0 0 9 1 13 1 0 11476 100889 36 > 2.29e+00 136 3.74e-01 100 > >>> 277:MGInterp Level 4 68 1.0 1.7674e-01 1.0 3.23e+08 1.2 6.1e+03 > 6.7e+03 0.0e+00 0 5 4 1 0 33 11 7 4 0 42715 621223 36 > 2.98e+01 136 3.95e+00 100 > >>> > >>> ex56-summit-gpu-36ranks-converged.txt > >>> 236:MatMultTranspose 136 1.0 2.9692e-01 1.0 1.17e+08 1.2 1.9e+04 > 1.5e+03 0.0e+00 1 3 6 1 0 40 6 10 3 0 13521 336701 8 > 2.08e+01 272 1.59e+00 100 > >>> 268:MGInterp Level 1 68 1.0 3.8752e-03 2.5 1.03e+05 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 27 79 37 > 3.95e-04 68 4.53e-05 100 > >>> 271:MGInterp Level 2 68 1.0 3.5465e-02 2.2 9.12e+05 2.5 2.4e+03 > 7.1e+02 0.0e+00 0 0 1 0 0 4 0 1 0 0 655 5989 36 > 8.16e-02 136 4.89e-02 100 > >>> 274:MGInterp Level 3 68 1.0 6.7101e-02 1.1 1.75e+07 1.5 2.3e+04 > 3.9e+02 0.0e+00 0 0 7 0 0 9 1 12 1 0 8455 56175 36 > 1.55e+00 136 3.03e-01 100 > >>> 277:MGInterp Level 4 68 1.0 2.4317e-01 1.0 2.20e+08 1.2 1.4e+04 > 3.4e+03 0.0e+00 0 5 4 1 0 33 11 7 4 0 31062 586044 36 > 1.99e+01 136 2.82e+00 100 > >> > >> 223:MatMultTranspose 136 1.0 2.0702e-01 2.9 6.59e+07 1.2 2.7e+04 > 1.1e+03 0.0e+00 1 3 7 1 0 7 6 12 3 0 19553 > >> 251:MGInterp Level 1 68 1.0 2.8062e-04 1.5 9.79e+04 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 349 > >> 254:MGInterp Level 2 68 1.0 6.2506e-0331.9 9.69e+05 0.0 2.1e+03 > 6.3e+02 0.0e+00 0 0 1 0 0 0 0 1 0 0 3458 > >> 257:MGInterp Level 3 68 1.0 4.8159e-02 6.5 9.62e+06 1.5 2.5e+04 > 4.2e+02 0.0e+00 0 0 6 0 0 1 1 11 1 0 11199 > >> 260:MGInterp Level 4 68 1.0 2.5707e-01 1.5 1.23e+08 1.2 2.7e+04 > 1.9e+03 0.0e+00 1 5 7 1 0 13 12 12 5 0 29294 > >> > >> Power9 still has an edge here. >
