On Wed, Aug 14, 2019 at 2:19 PM Smith, Barry F. <bsm...@mcs.anl.gov> wrote:
> > Mark, > > This is great, we can study these for months. > > 1) At the top of the plots you say SNES but that can't be right, there is > no way it is getting such speed ups for the entire SNES solve since the > Jacobians are CPUs and take much of the time. Do you mean the KSP part of > the SNES solve? > It uses KSPONLY. And solve times are KSPSolve with KSPSetUp called before. > > 2) For the case of a bit more than 1000 processes the speedup with GPUs is > fantastic, more than 6? > I did not see that one, but it is plausible and there is some noise in this data. The largest solve had a speedup of about 4x. > > 3) People will ask about runs using all 48 CPUs, since they are there it > is a little unfair to only compare 24 CPUs with the GPUs. Presumably due to > memory bandwidth limits 48 won't be much better than 24 but you need it in > your back pocket for completeness. > > Ah, good point. I just cut and paste but I should run a little test and see where it saturates. > 4) From the table > > KSPSolve 1 1.0 5.4191e-02 1.0 9.35e+06 7.3 1.3e+04 5.6e+02 > 8.3e+01 0 0 4 0 3 10 57 97 52 81 1911 3494 114 3.06e-01 129 > 1.38e-01 84 > PCApply 17 1.0 4.5053e-02 1.0 9.22e+06 8.5 1.1e+04 5.6e+02 > 3.4e+01 0 0 3 0 1 8 49 81 44 33 1968 4007 98 2.58e-01 113 > 1.19e-01 81 > > only 84 percent of the total flops in the KSPSolve are on the GPU and only > 81 for the PCApply() where are the rest? MatMult() etc are doing 100 > percent on the GPU, MatSolve on the coarsest level should be tiny and not > taking 19 percent of the flops? > > That is the smallest test with 3465 equations on 24 cores. the R and P and coarse grid are on the CPU. Look at larger tests. > Thanks > > Barry > > > > On Aug 14, 2019, at 12:45 PM, Mark Adams <mfad...@lbl.gov> wrote: > > > > FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x GPU > speedup with 98K dof/proc (3D Q2 elasticity). > > > > This is weak scaling of a solve. There is growth in iteration count > folded in here. I should put rtol in the title and/or run a fixed number of > iterations and make it clear in the title. > > > > Comments welcome. > > > <out_cpu_012288><out_cpu_001536><out_cuda_012288><out_cpu_000024><out_cpu_000192><out_cuda_001536><out_cuda_000192><out_cuda_000024><weak_scaling_cpu.png><weak_scaling_cuda.png> > >