[petsc-dev] Understanding Some Parallel Results with PETSc

recrusader Thu, 23 Feb 2012 12:34:58 -0600

Dear Dave,

Did you run the codes with double precision?


Thanks,
Yujie

On Thu, Feb 23, 2012 at 11:06 AM, Nystrom, William D <wdn at lanl.gov> wrote:

> I recently ran a couple of test runs with petsc-dev that I do not
> understand.  I'm running on a test bed
> machine that has 4 nodes with two Tesla 2090 gpus per node.  Each node is
> dual socket and populated
> with Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz processors.  These are 8
> core processors and so each
> node has 16 cores.  On the gpu, I'm running with Paul's latest version of
> the txpetscgpu package.  I'm
> running the src/ksp/ksp/examples/tutorials/ex2.c petsc example with
> m=n=10000.  My objective was
> to compare the performance running on 4 nodes using all 8 gpus to that of
> running on the same 4 nodes
> with all 64 cores.  This problem uses about a third of the memory
> available on the gpus.  I was using cg
> with jacobi preconditioning on both the gpu run and the cpu run.  What is
> puzzling to me is that the cpu
> case ran 44x times slower than the gpu case and the big difference was in
> the time spend in functions
> like VecTDot, VecNorm and VecAXPY.
>
> Below is a table that summarizes the performance of the main functions
> that were using time in the
> two runs.  Times are in seconds.
>
>                 |      GPU      |      CPU     |    Ratio
> -------------------------------------------------------------------------
> MatMult     |     450.64    |     5484.7    |     12.17
> -------------------------------------------------------------------------
> VecTDot    |     285.35    |   16688.0    |     58.48
> -------------------------------------------------------------------------
> VecNorm   |       19.03    |     9058.8    |   476.03
> -------------------------------------------------------------------------
> VecAXPY  |     106.88    |     5636.3    |     52.73
> -------------------------------------------------------------------------
> VecAYPX  |       53.69    |        85.1    |       1.58
> -------------------------------------------------------------------------
> KSPSolve  |     811.95    |   35930.0    |     44.25
> -------------------------------------------------------------------------
>
> The ratio of MatMult for CPU versus GPU is what I typically see when I am
> comparing a CPU run on
> a single core versus a run on a single GPU.  Since both runs are
> communicating across node via mpi,
> I'm puzzled about why the CPU case is so much slower than the GPU case
> especially since there is
> communication for the MatMult as well.  Both runs compute the same final
> error norm using the exact
> same number of iterations.  Do these results make sense to people who
> understand the performance
> issues of parallel sparse linear solvers much better than I?  Or do these
> results look abnormal.  I had
> wondered if part of the performance issue was related to my running 8
> times as many mpi processes
> for the CPU case.  However, I ran a smaller problem with m=n=1000 and
> using 8 mpi processes and
> 2 cores per node and I see the same extreme differences in the times spent
> in VecTDot, VecNorm
> and VecAXPY.
>
> Here are the command lines I used for the two runs:
>
> CPU:
>
> mpirun -np 64 -mca btl self,sm,openib ex2 -m 10000 -n 10000 -ksp_type cg
> -ksp_max_it 100000 -pc_type jacobi -log_summary -options_left
>
> GPU:
>
> mpirun -np 8 -npernode 2 -mca btl self,sm,openib ex2 -m 10000 -n 10000
> -ksp_type cg -ksp_max_it 100000 -pc_type jacobi -log_summary -options_left
> -mat_type aijcusp -vec_type cusp -cusp_storage_format dia
>
> Thanks,
>
> Dave
>
> --
> Dave Nystrom
> LANL HPC-5
> Phone: 505-667-7913
> Email: wdn at lanl.gov
> Smail: Mail Stop B272
>       Group HPC-5
>       Los Alamos National Laboratory
>       Los Alamos, NM 87545
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20120223/ddb3115e/attachment.html>

[petsc-dev] Understanding Some Parallel Results with PETSc

Reply via email to