On Tue, Apr 20, 2021 at 9:06 PM Sreepathi, Sarat <sa...@ornl.gov> wrote:
> Already tried those but it didn't help. I have been trying to experiment > with 48x1, 24x2 etc. and performance degraded for the climate workload. > I have problems even using all 48 cores on both my Kokkos Landau code and KK matrix-vector products (basically) in algebraic multigrid (AMG). For AMG using 8 (threads) x 4 (MPI) was best and thread speedup was moderate. I don't know how well KK vectorizes but in principle they should be able to make that work (they can write any code they want in KK). For Landau, I get great thread speedup, This code is MPI serial. I get the same throughput with 32x1, 16x2, 8x4 and 4x8. It looks like I am not getting any vectorization. With a large (10 species) test that I use as my test case, it runs very slow when I use all 48 cores in any configuration. With 2 species it does not die, just not great, but I have not looked at this in any detail. Let us know if you find anything. Thanks, Mark >