Sure, this is definitely not for the public, it is just numbers one can give 
to OLCF, AMD, and Kokkos to ensure things are as they should be going to.


> On Jan 24, 2022, at 3:30 PM, Munson, Todd <tmun...@mcs.anl.gov> wrote:
> 
> I want to note that crusher is early access hardware, so we should expect 
> performance to not be great right now.  Doing what we can to help identify 
> the performance issues and keeping OLCF informed would be the best.
>  
> Note that we cannot make any of the preliminary results publicly available 
> without explicit permission from OLCF; all of the results have to be 
> considered preliminary and the software stack will undergo a rapid churn.
>  
> All the best, Todd.
>  
> From: petsc-dev <petsc-dev-boun...@mcs.anl.gov 
> <mailto:petsc-dev-boun...@mcs.anl.gov>> on behalf of Barry Smith 
> <bsm...@petsc.dev <mailto:bsm...@petsc.dev>>
> Date: Monday, January 24, 2022 at 2:24 PM
> To: Justin Chang <jychan...@gmail.com <mailto:jychan...@gmail.com>>
> Cc: "petsc-dev@mcs.anl.gov <mailto:petsc-dev@mcs.anl.gov>" 
> <petsc-dev@mcs.anl.gov <mailto:petsc-dev@mcs.anl.gov>>
> Subject: Re: [petsc-dev] Kokkos/Crusher perforance
>  
>  
>   For this, to start, someone can run 
>  
> src/vec/vec/tutorials/performance.c 
> 
> 
> and compare the performance to that in the technical report Evaluation of 
> PETSc on a Heterogeneous Architecture \\ the OLCF Summit System \\ Part I: 
> Vector Node Performance. Google to find. One does not have to and shouldn't 
> do an extensive study right now that compares everything, instead one should 
> run a very small number of different size problems (make them big) and 
> compare those sizes with what Summit gives. Note you will need to make sure 
> that performance.c uses the Kokkos backend.
>  
>   One hopes for better performance than Summit; if one gets tons worse we 
> know something is very wrong somewhere. I'd love to see some comparisons.
>  
>   Barry
>  
> 
> 
>> On Jan 24, 2022, at 3:06 PM, Justin Chang <jychan...@gmail.com 
>> <mailto:jychan...@gmail.com>> wrote:
>>  
>> Also, do you guys have an OLCF liaison? That's actually your better bet if 
>> you do. 
>> 
>> Performance issues with ROCm/Kokkos are pretty common in apps besides just 
>> PETSc. We have several teams actively working on rectifying this. However, I 
>> think performance issues can be quicker to identify if we had a more 
>> "official" and reproducible PETSc GPU benchmark, which I've already 
>> expressed to some folks in this thread, and as others already commented on 
>> the difficulty of such a task. Hopefully I will have more time soon to 
>> illustrate what I am thinking.
>>  
>> On Mon, Jan 24, 2022 at 1:57 PM Justin Chang <jychan...@gmail.com 
>> <mailto:jychan...@gmail.com>> wrote:
>>> My name has been called.
>>>  
>>> Mark, if you're having issues with Crusher, please contact Veronica Vergara 
>>> (vergar...@ornl.gov <mailto:vergar...@ornl.gov>). You can cc me 
>>> (justin.ch...@amd.com <mailto:justin.ch...@amd.com>) in those emails
>>>  
>>> On Mon, Jan 24, 2022 at 1:49 PM Barry Smith <bsm...@petsc.dev 
>>> <mailto:bsm...@petsc.dev>> wrote:
>>>>  
>>>> 
>>>> 
>>>>> On Jan 24, 2022, at 2:46 PM, Mark Adams <mfad...@lbl.gov 
>>>>> <mailto:mfad...@lbl.gov>> wrote:
>>>>>  
>>>>> Yea, CG/Jacobi is as close to a benchmark code as we could want. I could 
>>>>> run this on one processor to get cleaner numbers.
>>>>>  
>>>>> Is there a designated ECP technical support contact?
>>>>  
>>>>    Mark, you've forgotten you work for DOE. There isn't a non-ECP 
>>>> technical support contact. 
>>>>  
>>>>    But if this is an AMD machine then maybe contact Matt's student Justin 
>>>> Chang?
>>>>  
>>>>  
>>>> 
>>>> 
>>>>>  
>>>>>  
>>>>> On Mon, Jan 24, 2022 at 2:18 PM Barry Smith <bsm...@petsc.dev 
>>>>> <mailto:bsm...@petsc.dev>> wrote:
>>>>>>  
>>>>>>   I think you should contact the crusher ECP technical support team and 
>>>>>> tell them you are getting dismel performance and ask if you should 
>>>>>> expect better. Don't waste time flogging a dead horse. 
>>>>>> 
>>>>>> 
>>>>>>> On Jan 24, 2022, at 2:16 PM, Matthew Knepley <knep...@gmail.com 
>>>>>>> <mailto:knep...@gmail.com>> wrote:
>>>>>>>  
>>>>>>> On Mon, Jan 24, 2022 at 2:11 PM Junchao Zhang <junchao.zh...@gmail.com 
>>>>>>> <mailto:junchao.zh...@gmail.com>> wrote:
>>>>>>>>  
>>>>>>>>  
>>>>>>>> On Mon, Jan 24, 2022 at 12:55 PM Mark Adams <mfad...@lbl.gov 
>>>>>>>> <mailto:mfad...@lbl.gov>> wrote:
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>> On Mon, Jan 24, 2022 at 1:38 PM Junchao Zhang 
>>>>>>>>> <junchao.zh...@gmail.com <mailto:junchao.zh...@gmail.com>> wrote:
>>>>>>>>>> Mark, I think you can benchmark individual vector operations, and 
>>>>>>>>>> once we get reasonable profiling results, we can move to solvers etc.
>>>>>>>>>  
>>>>>>>>> Can you suggest a code to run or are you suggesting making a vector 
>>>>>>>>> benchmark code?
>>>>>>>> Make a vector benchmark code, testing vector operations that would be 
>>>>>>>> used in your solver.
>>>>>>>> Also, we can run MatMult() to see if the profiling result is 
>>>>>>>> reasonable.
>>>>>>>> Only once we get some solid results on basic operations, it is useful 
>>>>>>>> to run big codes.
>>>>>>>  
>>>>>>> So we have to make another throw-away code? Why not just look at the 
>>>>>>> vector ops in Mark's actual code?
>>>>>>>  
>>>>>>>    Matt
>>>>>>>  
>>>>>>>>>  
>>>>>>>>>>  
>>>>>>>>>> --Junchao Zhang
>>>>>>>>>>  
>>>>>>>>>>  
>>>>>>>>>> On Mon, Jan 24, 2022 at 12:09 PM Mark Adams <mfad...@lbl.gov 
>>>>>>>>>> <mailto:mfad...@lbl.gov>> wrote:
>>>>>>>>>>>  
>>>>>>>>>>>  
>>>>>>>>>>> On Mon, Jan 24, 2022 at 12:44 PM Barry Smith <bsm...@petsc.dev 
>>>>>>>>>>> <mailto:bsm...@petsc.dev>> wrote:
>>>>>>>>>>>>  
>>>>>>>>>>>>   Here except for VecNorm the GPU is used effectively in that most 
>>>>>>>>>>>> of the time is time is spent doing real work on the GPU
>>>>>>>>>>>>  
>>>>>>>>>>>> VecNorm              402 1.0 4.4100e-01 6.1 1.69e+09 1.0 0.0e+00 
>>>>>>>>>>>> 0.0e+00 4.0e+02  0  1  0  0 20   9  1  0  0 33 30230   225393      
>>>>>>>>>>>> 0 0.00e+00    0 0.00e+00 100
>>>>>>>>>>>>  
>>>>>>>>>>>> Even the dots are very effective, only the VecNorm flop rate over 
>>>>>>>>>>>> the full time is much much lower than the vecdot. Which is somehow 
>>>>>>>>>>>> due to the use of the GPU or CPU MPI in the allreduce?
>>>>>>>>>>>  
>>>>>>>>>>> The VecNorm GPU rate is relatively high on Crusher and the CPU rate 
>>>>>>>>>>> is about the same as the other vec ops. I don't know what to make 
>>>>>>>>>>> of that.
>>>>>>>>>>>  
>>>>>>>>>>> But Crusher is clearly not crushing it. 
>>>>>>>>>>>  
>>>>>>>>>>> Junchao: Perhaps we should ask Kokkos if they have any experience 
>>>>>>>>>>> with Crusher that they can share. They could very well find some 
>>>>>>>>>>> low level magic.
>>>>>>>>>>>  
>>>>>>>>>>>  
>>>>>>>>>>>>  
>>>>>>>>>>>>  
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Jan 24, 2022, at 12:14 PM, Mark Adams <mfad...@lbl.gov 
>>>>>>>>>>>>> <mailto:mfad...@lbl.gov>> wrote:
>>>>>>>>>>>>>  
>>>>>>>>>>>>>  
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Mark, can we compare with Spock?
>>>>>>>>>>>>>  
>>>>>>>>>>>>>  Looks much better. This puts two processes/GPU because there are 
>>>>>>>>>>>>> only 4.
>>>>>>>>>>>>> <jac_out_001_kokkos_Spock_6_1_notpl.txt>
>>>>>>>>>>>> 
>>>>>>>>>>>>  
>>>>>>> 
>>>>>>> 
>>>>>>>  
>>>>>>> -- 
>>>>>>> What most experimenters take for granted before they begin their 
>>>>>>> experiments is infinitely more interesting than any results to which 
>>>>>>> their experiments lead.
>>>>>>> -- Norbert Wiener
>>>>>>>  
>>>>>>> https://www.cse.buffalo.edu/~knepley/ 
>>>>>>> <http://www.cse.buffalo.edu/~knepley/>

Reply via email to