*Better to have an abstract for readers to know your intention/conclusion

*p.5  "We also launch all jobs using the --launch_distribution cyclic option so 
that MPI ranks are assigned to resource sets in a circular fashion, which we 
deem appropriate for most high performance computing (HPC) algorithms."
Cyclic distribution is fine for these simple Vec ops since there is almost no 
communication, but can not be deemed appropriate for most HPC algorithms. I 
assume packed distribution is better for locality.

*Fig. 1 Left. I would use the diagram at p.11 of 
https://press3.mcs.anl.gov/atpesc/files/2018/08/ATPESC_2018_Track-1_6_7-30_130pm_Hill-Summit_at_ORNL.pdf,
 which is more informative and contains a lot of numbers we can compare with 
your results.  E.g., peak bandwidth, you mentioned but did not list.

*2.1 cudaMemcopy ?
 For the two bullets VecAXPY, VecDot, you'd better clearly list how you counted 
their FLOPS & memory, which you used to calculate bandwidth and performance in 
the report.

*p.12 VecACPY ?
*p.12 I don't the difference of the two GPU launch time.

*When appropriate, can you draw a line for hardware peak bandwidth or FLOPS/s 
in the figures.

*p.13, some bullets are not important and you can mention them earlier in your 
experimental setup.
bullet 4: I think the reason is: to get peak CPU->GPU bandwidth, the cpu buffer 
has to be pinned (i.e. non-pageable).

--Junchao Zhang


On Wed, Oct 9, 2019 at 5:34 PM Smith, Barry F. via petsc-dev 
<petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>> wrote:

   We've prepared a short report on the performance of vector operations on 
Summit and would appreciate any feed back including: inconsistencies, lack of 
clarity, incorrect notation or terminology, etc.

   Thanks

    Barry, Hannah, and Richard





Reply via email to