I want to note that crusher is early access hardware, so we should expect 
performance to not be great right now.  Doing what we can to help identify the 
performance issues and keeping OLCF informed would be the best.

Note that we cannot make any of the preliminary results publicly available 
without explicit permission from OLCF; all of the results have to be considered 
preliminary and the software stack will undergo a rapid churn.

All the best, Todd.

From: petsc-dev <petsc-dev-boun...@mcs.anl.gov> on behalf of Barry Smith 
<bsm...@petsc.dev>
Date: Monday, January 24, 2022 at 2:24 PM
To: Justin Chang <jychan...@gmail.com>
Cc: "petsc-dev@mcs.anl.gov" <petsc-dev@mcs.anl.gov>
Subject: Re: [petsc-dev] Kokkos/Crusher perforance


  For this, to start, someone can run

src/vec/vec/tutorials/performance.c


and compare the performance to that in the technical report Evaluation of PETSc 
on a Heterogeneous Architecture \\ the OLCF Summit System \\ Part I: Vector 
Node Performance. Google to find. One does not have to and shouldn't do an 
extensive study right now that compares everything, instead one should run a 
very small number of different size problems (make them big) and compare those 
sizes with what Summit gives. Note you will need to make sure that 
performance.c uses the Kokkos backend.

  One hopes for better performance than Summit; if one gets tons worse we know 
something is very wrong somewhere. I'd love to see some comparisons.

  Barry



On Jan 24, 2022, at 3:06 PM, Justin Chang 
<jychan...@gmail.com<mailto:jychan...@gmail.com>> wrote:

Also, do you guys have an OLCF liaison? That's actually your better bet if you 
do.

Performance issues with ROCm/Kokkos are pretty common in apps besides just 
PETSc. We have several teams actively working on rectifying this. However, I 
think performance issues can be quicker to identify if we had a more "official" 
and reproducible PETSc GPU benchmark, which I've already expressed to some 
folks in this thread, and as others already commented on the difficulty of such 
a task. Hopefully I will have more time soon to illustrate what I am thinking.

On Mon, Jan 24, 2022 at 1:57 PM Justin Chang 
<jychan...@gmail.com<mailto:jychan...@gmail.com>> wrote:
My name has been called.

Mark, if you're having issues with Crusher, please contact Veronica Vergara 
(vergar...@ornl.gov<mailto:vergar...@ornl.gov>). You can cc me 
(justin.ch...@amd.com<mailto:justin.ch...@amd.com>) in those emails

On Mon, Jan 24, 2022 at 1:49 PM Barry Smith 
<bsm...@petsc.dev<mailto:bsm...@petsc.dev>> wrote:



On Jan 24, 2022, at 2:46 PM, Mark Adams 
<mfad...@lbl.gov<mailto:mfad...@lbl.gov>> wrote:

Yea, CG/Jacobi is as close to a benchmark code as we could want. I could run 
this on one processor to get cleaner numbers.

Is there a designated ECP technical support contact?

   Mark, you've forgotten you work for DOE. There isn't a non-ECP technical 
support contact.

   But if this is an AMD machine then maybe contact Matt's student Justin Chang?






On Mon, Jan 24, 2022 at 2:18 PM Barry Smith 
<bsm...@petsc.dev<mailto:bsm...@petsc.dev>> wrote:

  I think you should contact the crusher ECP technical support team and tell 
them you are getting dismel performance and ask if you should expect better. 
Don't waste time flogging a dead horse.


On Jan 24, 2022, at 2:16 PM, Matthew Knepley 
<knep...@gmail.com<mailto:knep...@gmail.com>> wrote:

On Mon, Jan 24, 2022 at 2:11 PM Junchao Zhang 
<junchao.zh...@gmail.com<mailto:junchao.zh...@gmail.com>> wrote:


On Mon, Jan 24, 2022 at 12:55 PM Mark Adams 
<mfad...@lbl.gov<mailto:mfad...@lbl.gov>> wrote:


On Mon, Jan 24, 2022 at 1:38 PM Junchao Zhang 
<junchao.zh...@gmail.com<mailto:junchao.zh...@gmail.com>> wrote:
Mark, I think you can benchmark individual vector operations, and once we get 
reasonable profiling results, we can move to solvers etc.

Can you suggest a code to run or are you suggesting making a vector benchmark 
code?
Make a vector benchmark code, testing vector operations that would be used in 
your solver.
Also, we can run MatMult() to see if the profiling result is reasonable.
Only once we get some solid results on basic operations, it is useful to run 
big codes.

So we have to make another throw-away code? Why not just look at the vector ops 
in Mark's actual code?

   Matt



--Junchao Zhang


On Mon, Jan 24, 2022 at 12:09 PM Mark Adams 
<mfad...@lbl.gov<mailto:mfad...@lbl.gov>> wrote:


On Mon, Jan 24, 2022 at 12:44 PM Barry Smith 
<bsm...@petsc.dev<mailto:bsm...@petsc.dev>> wrote:

  Here except for VecNorm the GPU is used effectively in that most of the time 
is time is spent doing real work on the GPU

VecNorm              402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00 0.0e+00 
4.0e+02  0  1  0  0 20   9  1  0  0 33 30230   225393      0 0.00e+00    0 
0.00e+00 100

Even the dots are very effective, only the VecNorm flop rate over the full time 
is much much lower than the vecdot. Which is somehow due to the use of the GPU 
or CPU MPI in the allreduce?

The VecNorm GPU rate is relatively high on Crusher and the CPU rate is about 
the same as the other vec ops. I don't know what to make of that.

But Crusher is clearly not crushing it.

Junchao: Perhaps we should ask Kokkos if they have any experience with Crusher 
that they can share. They could very well find some low level magic.






On Jan 24, 2022, at 12:14 PM, Mark Adams 
<mfad...@lbl.gov<mailto:mfad...@lbl.gov>> wrote:



Mark, can we compare with Spock?

 Looks much better. This puts two processes/GPU because there are only 4.
<jac_out_001_kokkos_Spock_6_1_notpl.txt>



--
What most experimenters take for granted before they begin their experiments is 
infinitely more interesting than any results to which their experiments lead.
-- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/<http://www.cse.buffalo.edu/~knepley/>



Reply via email to