Mark,

    This is great, we can study these for months. 

1) At the top of the plots you say SNES  but that can't be right, there is no 
way it is getting such speed ups for the entire SNES solve since the Jacobians 
are CPUs and take much of the time. Do you mean the KSP part of the SNES solve? 

2) For the case of a bit more than 1000 processes the speedup with GPUs is 
fantastic, more than 6?

3) People will ask about runs using all 48 CPUs, since they are there it is a 
little unfair to only compare 24 CPUs with the GPUs. Presumably due to memory 
bandwidth limits 48 won't be much better than 24 but you need it in your back 
pocket for completeness.

4) From the table

KSPSolve               1 1.0 5.4191e-02 1.0 9.35e+06 7.3 1.3e+04 5.6e+02 
8.3e+01  0  0  4  0  3  10 57 97 52 81  1911    3494    114 3.06e-01  129 
1.38e-01 84
PCApply               17 1.0 4.5053e-02 1.0 9.22e+06 8.5 1.1e+04 5.6e+02 
3.4e+01  0  0  3  0  1   8 49 81 44 33  1968    4007     98 2.58e-01  113 
1.19e-01 81

only 84 percent of the total flops in the KSPSolve are on the GPU and only 81 
for the PCApply() where are the rest? MatMult() etc are doing 100 percent on 
the GPU, MatSolve on the coarsest level should be tiny and not taking 19 
percent of the flops?

  Thanks

   Barry


> On Aug 14, 2019, at 12:45 PM, Mark Adams <mfad...@lbl.gov> wrote:
> 
> FYI, Here is some scaling data of GAMG on SUMMIT. Getting about 4x GPU 
> speedup with 98K dof/proc (3D Q2 elasticity).
> 
> This is weak scaling of a solve. There is growth in iteration count folded in 
> here. I should put rtol in the title and/or run a fixed number of iterations 
> and make it clear in the title.
> 
> Comments welcome.
> <out_cpu_012288><out_cpu_001536><out_cuda_012288><out_cpu_000024><out_cpu_000192><out_cuda_001536><out_cuda_000192><out_cuda_000024><weak_scaling_cpu.png><weak_scaling_cuda.png>

Reply via email to