Sure, this is definitely not for the public, it is just numbers one can give to OLCF, AMD, and Kokkos to ensure things are as they should be going to.
> On Jan 24, 2022, at 3:30 PM, Munson, Todd <tmun...@mcs.anl.gov> wrote: > > I want to note that crusher is early access hardware, so we should expect > performance to not be great right now. Doing what we can to help identify > the performance issues and keeping OLCF informed would be the best. > > Note that we cannot make any of the preliminary results publicly available > without explicit permission from OLCF; all of the results have to be > considered preliminary and the software stack will undergo a rapid churn. > > All the best, Todd. > > From: petsc-dev <petsc-dev-boun...@mcs.anl.gov > <mailto:petsc-dev-boun...@mcs.anl.gov>> on behalf of Barry Smith > <bsm...@petsc.dev <mailto:bsm...@petsc.dev>> > Date: Monday, January 24, 2022 at 2:24 PM > To: Justin Chang <jychan...@gmail.com <mailto:jychan...@gmail.com>> > Cc: "petsc-dev@mcs.anl.gov <mailto:petsc-dev@mcs.anl.gov>" > <petsc-dev@mcs.anl.gov <mailto:petsc-dev@mcs.anl.gov>> > Subject: Re: [petsc-dev] Kokkos/Crusher perforance > > > For this, to start, someone can run > > src/vec/vec/tutorials/performance.c > > > and compare the performance to that in the technical report Evaluation of > PETSc on a Heterogeneous Architecture \\ the OLCF Summit System \\ Part I: > Vector Node Performance. Google to find. One does not have to and shouldn't > do an extensive study right now that compares everything, instead one should > run a very small number of different size problems (make them big) and > compare those sizes with what Summit gives. Note you will need to make sure > that performance.c uses the Kokkos backend. > > One hopes for better performance than Summit; if one gets tons worse we > know something is very wrong somewhere. I'd love to see some comparisons. > > Barry > > > >> On Jan 24, 2022, at 3:06 PM, Justin Chang <jychan...@gmail.com >> <mailto:jychan...@gmail.com>> wrote: >> >> Also, do you guys have an OLCF liaison? That's actually your better bet if >> you do. >> >> Performance issues with ROCm/Kokkos are pretty common in apps besides just >> PETSc. We have several teams actively working on rectifying this. However, I >> think performance issues can be quicker to identify if we had a more >> "official" and reproducible PETSc GPU benchmark, which I've already >> expressed to some folks in this thread, and as others already commented on >> the difficulty of such a task. Hopefully I will have more time soon to >> illustrate what I am thinking. >> >> On Mon, Jan 24, 2022 at 1:57 PM Justin Chang <jychan...@gmail.com >> <mailto:jychan...@gmail.com>> wrote: >>> My name has been called. >>> >>> Mark, if you're having issues with Crusher, please contact Veronica Vergara >>> (vergar...@ornl.gov <mailto:vergar...@ornl.gov>). You can cc me >>> (justin.ch...@amd.com <mailto:justin.ch...@amd.com>) in those emails >>> >>> On Mon, Jan 24, 2022 at 1:49 PM Barry Smith <bsm...@petsc.dev >>> <mailto:bsm...@petsc.dev>> wrote: >>>> >>>> >>>> >>>>> On Jan 24, 2022, at 2:46 PM, Mark Adams <mfad...@lbl.gov >>>>> <mailto:mfad...@lbl.gov>> wrote: >>>>> >>>>> Yea, CG/Jacobi is as close to a benchmark code as we could want. I could >>>>> run this on one processor to get cleaner numbers. >>>>> >>>>> Is there a designated ECP technical support contact? >>>> >>>> Mark, you've forgotten you work for DOE. There isn't a non-ECP >>>> technical support contact. >>>> >>>> But if this is an AMD machine then maybe contact Matt's student Justin >>>> Chang? >>>> >>>> >>>> >>>> >>>>> >>>>> >>>>> On Mon, Jan 24, 2022 at 2:18 PM Barry Smith <bsm...@petsc.dev >>>>> <mailto:bsm...@petsc.dev>> wrote: >>>>>> >>>>>> I think you should contact the crusher ECP technical support team and >>>>>> tell them you are getting dismel performance and ask if you should >>>>>> expect better. Don't waste time flogging a dead horse. >>>>>> >>>>>> >>>>>>> On Jan 24, 2022, at 2:16 PM, Matthew Knepley <knep...@gmail.com >>>>>>> <mailto:knep...@gmail.com>> wrote: >>>>>>> >>>>>>> On Mon, Jan 24, 2022 at 2:11 PM Junchao Zhang <junchao.zh...@gmail.com >>>>>>> <mailto:junchao.zh...@gmail.com>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Jan 24, 2022 at 12:55 PM Mark Adams <mfad...@lbl.gov >>>>>>>> <mailto:mfad...@lbl.gov>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Jan 24, 2022 at 1:38 PM Junchao Zhang >>>>>>>>> <junchao.zh...@gmail.com <mailto:junchao.zh...@gmail.com>> wrote: >>>>>>>>>> Mark, I think you can benchmark individual vector operations, and >>>>>>>>>> once we get reasonable profiling results, we can move to solvers etc. >>>>>>>>> >>>>>>>>> Can you suggest a code to run or are you suggesting making a vector >>>>>>>>> benchmark code? >>>>>>>> Make a vector benchmark code, testing vector operations that would be >>>>>>>> used in your solver. >>>>>>>> Also, we can run MatMult() to see if the profiling result is >>>>>>>> reasonable. >>>>>>>> Only once we get some solid results on basic operations, it is useful >>>>>>>> to run big codes. >>>>>>> >>>>>>> So we have to make another throw-away code? Why not just look at the >>>>>>> vector ops in Mark's actual code? >>>>>>> >>>>>>> Matt >>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> --Junchao Zhang >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Jan 24, 2022 at 12:09 PM Mark Adams <mfad...@lbl.gov >>>>>>>>>> <mailto:mfad...@lbl.gov>> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Jan 24, 2022 at 12:44 PM Barry Smith <bsm...@petsc.dev >>>>>>>>>>> <mailto:bsm...@petsc.dev>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Here except for VecNorm the GPU is used effectively in that most >>>>>>>>>>>> of the time is time is spent doing real work on the GPU >>>>>>>>>>>> >>>>>>>>>>>> VecNorm 402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00 >>>>>>>>>>>> 0.0e+00 4.0e+02 0 1 0 0 20 9 1 0 0 33 30230 225393 >>>>>>>>>>>> 0 0.00e+00 0 0.00e+00 100 >>>>>>>>>>>> >>>>>>>>>>>> Even the dots are very effective, only the VecNorm flop rate over >>>>>>>>>>>> the full time is much much lower than the vecdot. Which is somehow >>>>>>>>>>>> due to the use of the GPU or CPU MPI in the allreduce? >>>>>>>>>>> >>>>>>>>>>> The VecNorm GPU rate is relatively high on Crusher and the CPU rate >>>>>>>>>>> is about the same as the other vec ops. I don't know what to make >>>>>>>>>>> of that. >>>>>>>>>>> >>>>>>>>>>> But Crusher is clearly not crushing it. >>>>>>>>>>> >>>>>>>>>>> Junchao: Perhaps we should ask Kokkos if they have any experience >>>>>>>>>>> with Crusher that they can share. They could very well find some >>>>>>>>>>> low level magic. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On Jan 24, 2022, at 12:14 PM, Mark Adams <mfad...@lbl.gov >>>>>>>>>>>>> <mailto:mfad...@lbl.gov>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Mark, can we compare with Spock? >>>>>>>>>>>>> >>>>>>>>>>>>> Looks much better. This puts two processes/GPU because there are >>>>>>>>>>>>> only 4. >>>>>>>>>>>>> <jac_out_001_kokkos_Spock_6_1_notpl.txt> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> What most experimenters take for granted before they begin their >>>>>>> experiments is infinitely more interesting than any results to which >>>>>>> their experiments lead. >>>>>>> -- Norbert Wiener >>>>>>> >>>>>>> https://www.cse.buffalo.edu/~knepley/ >>>>>>> <http://www.cse.buffalo.edu/~knepley/>