Interesting, Is this with all native Kokkos kernels or do some kokkos kernels use rocm?
I ask because VecNorm is 4 times higher than VecDot, I would not expect that and VecAXPY is less than 1/4 the performance of VecAYPX (I would not expect that) MatMult 400 1.0 1.0288e+00 1.0 1.02e+11 1.0 0.0e+00 0.0e+00 0.0e+00 0 54 0 0 0 43 91 0 0 0 98964 0 0 0.00e+00 0 0.00e+00 100 MatView 2 1.0 3.3745e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 KSPSolve 2 1.0 2.3989e+00 1.0 1.12e+11 1.0 0.0e+00 0.0e+00 0.0e+00 1 60 0 0 0 100100 0 0 0 46887 220,001 0 0.00e+00 0 0.00e+00 100 VecTDot 802 1.0 4.7745e-01 1.0 3.29e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 20 3 0 0 0 6882 15,426 0 0.00e+00 0 0.00e+00 100 VecNorm 402 1.0 1.1532e-01 1.0 1.65e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 5 1 0 0 0 14281 62,757 0 0.00e+00 0 0.00e+00 100 VecCopy 4 1.0 2.1859e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 VecSet 4 1.0 2.1910e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 VecAXPY 800 1.0 5.5739e-01 1.0 3.28e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 2 0 0 0 23 3 0 0 0 5880 14,666 0 0.00e+00 0 0.00e+00 100 VecAYPX 398 1.0 1.0668e-01 1.0 1.63e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 4 1 0 0 0 15284 71,218 0 0.00e+00 0 0.00e+00 100 VecPointwiseMult 402 1.0 1.0930e-01 1.0 8.23e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 5 1 0 0 0 7534 33,579 0 0.00e+00 0 0.00e+00 100 PCApply 402 1.0 1.0940e-01 1.0 8.23e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 5 1 0 0 0 7527 33,579 0 0.00e+00 0 0.00e+00 100 > On Jan 21, 2022, at 9:46 PM, Mark Adams <mfad...@lbl.gov> wrote: > > > But in particular look at the VecTDot and VecNorm CPU flop rates > compared to the GPU, much lower, this tells me the MPI_Allreduce is likely > hurting performance in there also a great deal. It would be good to see a > single MPI rank job to compare to see performance without the MPI overhead. > > Here are two single processor runs, with a whole GPU. It's not clear of > --ntasks-per-gpu=1 refers to the GPU socket (4 of them) or the GPUs (8). > > <jac_out_001_kokkos_Crusher_3_1_gpuawarempi.txt><jac_out_001_kokkos_Crusher_4_1_gpuawarempi.txt>