On Wed, Apr 5, 2017 at 12:23 PM, Justin Chang <[email protected]> wrote:
> I simply ran these KNL simulations in flat mode with the following options: > > srun -n 64 -c 4 --cpu_bind=cores numactl -p 1 ./ex48 .... > > Basically I told it that MCDRAM usage in NUMA domain 1 is preferred. I > followed the last example: http://www.nersc.gov/users/ > computational-systems/cori/configuration/knl-processor-modes/ > Right. I think, from the prior discussion, that -m 1 causes the run to fail if you spill out of MCDRAM. I think that is usually what we want since it makes things easier to interpret and running MKL from DRAM is like towing your McLaren with your Toyota. Matt > On Wed, Apr 5, 2017 at 12:00 PM, Matthew Knepley <[email protected]> > wrote: > >> On Wed, Apr 5, 2017 at 11:54 AM, Zhang, Hong <[email protected]> wrote: >> >>> >>> > On Apr 5, 2017, at 10:53 AM, Jed Brown <[email protected]> wrote: >>> > >>> > "Zhang, Hong" <[email protected]> writes: >>> > >>> >> On Apr 4, 2017, at 10:45 PM, Justin Chang <[email protected] >>> <mailto:[email protected]>> wrote: >>> >> >>> >> So I tried the following options: >>> >> >>> >> -M 40 >>> >> -N 40 >>> >> -P 5 >>> >> -da_refine 1/2/3/4 >>> >> -log_view >>> >> -mg_coarse_pc_type gamg >>> >> -mg_levels_0_pc_type gamg >>> >> -mg_levels_1_sub_pc_type cholesky >>> >> -pc_type mg >>> >> -thi_mat_type baij >>> >> >>> >> Performance improved dramatically. However, Haswell still beats out >>> KNL but only by a little. Now it seems like MatSOR is taking some time >>> (though I can't really judge whether it's significant or not). Attached are >>> the log files. >>> >> >>> >> >>> >> MatSOR takes only 3% of the total time. Most of the time is spent on >>> PCSetUp (~30%) and PCApply (~11%). >>> > >>> > I don't see any of your conclusions in the actual data, unless you only >>> > looked at the smallest size that Justin tested. For example, from the >>> > largest problem size in Justin's logs: >>> >>> My mistake. I did not see the results for the large problem sizes. I was >>> talking about the data for the smallest case. >>> >>> Now I am very surprised by the performance of MatSOR: >>> >>> -da_refine 1 ~2x slower on KNL >>> -da_refine 2 ~2x faster on KNL >>> -da_refine 3 ~2x faster on KNL >>> -da_refine 4 almost the same >>> >>> KNL >>> >>> -da_refine 1 MatSOR 1185 1.0 2.8965e-01 1.1 7.01e+07 1.0 >>> 0.0e+00 0.0e+00 0.0e+00 3 41 0 0 0 3 41 0 0 0 15231 >>> -da_refine 2 MatSOR 1556 1.0 1.6883e+00 1.0 5.82e+08 1.0 >>> 0.0e+00 0.0e+00 0.0e+00 11 44 0 0 0 11 44 0 0 0 22019 >>> -da_refine 3 MatSOR 2240 1.0 1.4959e+01 1.0 5.51e+09 1.0 >>> 0.0e+00 0.0e+00 0.0e+00 22 45 0 0 0 22 45 0 0 0 23571 >>> -da_refine 4 MatSOR 2688 1.0 2.3942e+02 1.1 4.47e+10 1.0 >>> 0.0e+00 0.0e+00 0.0e+00 36 45 0 0 0 36 45 0 0 0 11946 >>> >>> >>> Haswell >>> -da_refine 1 MatSOR 1167 1.0 1.4839e-01 1.1 1.42e+08 1.0 >>> 0.0e+00 0.0e+00 0.0e+00 3 42 0 0 0 3 42 0 0 0 30450 >>> -da_refine 2 MatSOR 1532 1.0 2.9772e+00 1.0 1.17e+09 1.0 >>> 0.0e+00 0.0e+00 0.0e+00 28 44 0 0 0 28 44 0 0 0 12539 >>> -da_refine 3 MatSOR 1915 1.0 2.7142e+01 1.1 9.51e+09 1.0 >>> 0.0e+00 0.0e+00 0.0e+00 45 45 0 0 0 45 45 0 0 0 11216 >>> -da_refine 4 MatSOR 2262 1.0 2.2116e+02 1.1 7.56e+10 1.0 >>> 0.0e+00 0.0e+00 0.0e+00 48 45 0 0 0 48 45 0 0 0 10936 >>> >> >> SOR should track memory bandwidth, so it seems to me either >> >> a) We fell out of MCDRAM >> >> or >> >> b) We saturated the KNL node, but not the Haswell configuration >> >> I think these are all runs with identical parallelism, so its not b). >> Justin, did you tell it to fall back to DRAM, or fail? >> >> Thanks, >> >> Matt >> >> >> >>> Hong (Mr.) >>> >>> >>> > KNL: >>> > MatSOR 2688 1.0 2.3942e+02 1.1 4.47e+10 1.0 0.0e+00 >>> 0.0e+00 0.0e+00 36 45 0 0 0 36 45 0 0 0 11946 >>> > KSPSolve 8 1.0 4.3837e+02 1.0 9.87e+10 1.0 1.5e+06 >>> 8.8e+03 5.0e+03 68 99 98 61 98 68 99 98 61 98 14409 >>> > SNESSolve 1 1.0 6.1583e+02 1.0 9.95e+10 1.0 1.6e+06 >>> 1.4e+04 5.1e+03 96100100100 99 96100100100 99 10338 >>> > SNESFunctionEval 9 1.0 3.8730e+01 1.0 0.00e+00 0.0 9.2e+03 >>> 3.2e+04 0.0e+00 6 0 1 1 0 6 0 1 1 0 0 >>> > SNESJacobianEval 40 1.0 1.5628e+02 1.0 0.00e+00 0.0 4.4e+04 >>> 2.5e+05 1.4e+02 24 0 3 49 3 24 0 3 49 3 0 >>> > PCSetUp 16 1.0 3.4525e+01 1.0 6.52e+07 1.0 2.8e+05 >>> 1.0e+04 3.8e+03 5 0 18 13 74 5 0 18 13 74 119 >>> > PCSetUpOnBlocks 60 1.0 9.5716e-01 1.1 1.41e+05 0.0 0.0e+00 >>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >>> > PCApply 60 1.0 3.8705e+02 1.0 9.32e+10 1.0 1.2e+06 >>> 8.0e+03 1.1e+03 60 94 79 45 21 60 94 79 45 21 15407 >>> > MatMult 2860 1.0 1.4578e+02 1.1 4.92e+10 1.0 1.2e+06 >>> 8.8e+03 0.0e+00 21 49 77 48 0 21 49 77 48 0 21579 >>> > >>> > Haswell: >>> > MatSOR 2262 1.0 2.2116e+02 1.1 7.56e+10 1.0 0.0e+00 >>> 0.0e+00 0.0e+00 48 45 0 0 0 48 45 0 0 0 10936 >>> > KSPSolve 7 1.0 3.5937e+02 1.0 1.67e+11 1.0 6.7e+05 >>> 1.3e+04 4.5e+03 81 99 98 60 98 81 99 98 60 98 14828 >>> > SNESSolve 1 1.0 4.3749e+02 1.0 1.68e+11 1.0 6.8e+05 >>> 2.1e+04 4.5e+03 99100100100 99 99100100100 99 12280 >>> > SNESFunctionEval 8 1.0 1.5460e+01 1.0 0.00e+00 0.0 4.1e+03 >>> 4.7e+04 0.0e+00 3 0 1 1 0 3 0 1 1 0 0 >>> > SNESJacobianEval 35 1.0 6.8994e+01 1.0 0.00e+00 0.0 1.9e+04 >>> 3.8e+05 1.3e+02 16 0 3 50 3 16 0 3 50 3 0 >>> > PCSetUp 14 1.0 1.0860e+01 1.0 1.15e+08 1.0 1.3e+05 >>> 1.4e+04 3.4e+03 2 0 19 13 74 2 0 19 13 74 335 >>> > PCSetUpOnBlocks 50 1.0 4.5601e-02 1.6 2.89e+05 0.0 0.0e+00 >>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 6 >>> > PCApply 50 1.0 3.3545e+02 1.0 1.57e+11 1.0 5.3e+05 >>> 1.2e+04 9.7e+02 75 94 77 44 21 75 94 77 44 21 15017 >>> > MatMult 2410 1.0 1.2050e+02 1.1 8.28e+10 1.0 5.1e+05 >>> 1.3e+04 0.0e+00 27 49 75 46 0 27 49 75 46 0 21983 >>> > >>> >> If ex48 has SSE2 intrinsics, does that mean Haswell would almost >>> always be better? >>> >> >>> >> The Jacobian evaluation (which has SSE2 intrinsics) on Haswell is >>> about two times as fast as on KNL, but it eats only 3%-4% of the total time. >>> > >>> > SNESJacobianEval alone accounts for 90 seconds of the 180 second >>> > difference between KNL and Haswell. >>> > >>> >> According to your logs, the compute-intensive kernels such as MatMult, >>> >> MatSOR, PCApply run faster (~2X) on Haswell. >>> > >>> > They run almost the same speed. >>> > >>> >> But since the setup time dominates in this test, >>> > >>> > It doesn't dominate on the larger sizes. >>> > >>> >> Haswell would not show much benefit. If you increase the problem size, >>> >> it could be expected that the performance gap would also increase. >>> > >>> > Backwards. Haswell is great for low latency on small problem sizes >>> > while KNL offers higher theoretical throughput (often not realized due >>> > to lack of vectorization) for sufficiently large problem sizes >>> > (especially if they don't fit in Haswell L3 cache but do fit in >>> MCDRAM). >>> >>> >> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
