I recently ran a couple of test runs with petsc-dev that I do not understand.  
I'm running on a test bed
machine that has 4 nodes with two Tesla 2090 gpus per node.  Each node is dual 
socket and populated
with Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz processors.  These are 8 core 
processors and so each
node has 16 cores.  On the gpu, I'm running with Paul's latest version of the 
txpetscgpu package.  I'm
running the src/ksp/ksp/examples/tutorials/ex2.c petsc example with m=n=10000.  
My objective was
to compare the performance running on 4 nodes using all 8 gpus to that of 
running on the same 4 nodes
with all 64 cores.  This problem uses about a third of the memory available on 
the gpus.  I was using cg
with jacobi preconditioning on both the gpu run and the cpu run.  What is 
puzzling to me is that the cpu
case ran 44x times slower than the gpu case and the big difference was in the 
time spend in functions
like VecTDot, VecNorm and VecAXPY.

Below is a table that summarizes the performance of the main functions that 
were using time in the
two runs.  Times are in seconds.

                 |      GPU      |      CPU     |    Ratio
-------------------------------------------------------------------------
MatMult     |     450.64    |     5484.7    |     12.17
-------------------------------------------------------------------------
VecTDot    |     285.35    |   16688.0    |     58.48
-------------------------------------------------------------------------
VecNorm   |       19.03    |     9058.8    |   476.03
-------------------------------------------------------------------------
VecAXPY  |     106.88    |     5636.3    |     52.73
-------------------------------------------------------------------------
VecAYPX  |       53.69    |        85.1    |       1.58
-------------------------------------------------------------------------
KSPSolve  |     811.95    |   35930.0    |     44.25
-------------------------------------------------------------------------

The ratio of MatMult for CPU versus GPU is what I typically see when I am 
comparing a CPU run on
a single core versus a run on a single GPU.  Since both runs are communicating 
across node via mpi,
I'm puzzled about why the CPU case is so much slower than the GPU case 
especially since there is
communication for the MatMult as well.  Both runs compute the same final error 
norm using the exact
same number of iterations.  Do these results make sense to people who 
understand the performance
issues of parallel sparse linear solvers much better than I?  Or do these 
results look abnormal.  I had
wondered if part of the performance issue was related to my running 8 times as 
many mpi processes
for the CPU case.  However, I ran a smaller problem with m=n=1000 and using 8 
mpi processes and
2 cores per node and I see the same extreme differences in the times spent in 
VecTDot, VecNorm
and VecAXPY.

Here are the command lines I used for the two runs:

CPU:

mpirun -np 64 -mca btl self,sm,openib ex2 -m 10000 -n 10000 -ksp_type cg 
-ksp_max_it 100000 -pc_type jacobi -log_summary -options_left

GPU:

mpirun -np 8 -npernode 2 -mca btl self,sm,openib ex2 -m 10000 -n 10000 
-ksp_type cg -ksp_max_it 100000 -pc_type jacobi -log_summary -options_left 
-mat_type aijcusp -vec_type cusp -cusp_storage_format dia

Thanks,

Dave

--
Dave Nystrom
LANL HPC-5
Phone: 505-667-7913
Email: wdn at lanl.gov
Smail: Mail Stop B272
       Group HPC-5
       Los Alamos National Laboratory
       Los Alamos, NM 87545


Reply via email to