Dear Dave, Did you run the codes with double precision?
Thanks, Yujie On Thu, Feb 23, 2012 at 11:06 AM, Nystrom, William D <wdn at lanl.gov> wrote: > I recently ran a couple of test runs with petsc-dev that I do not > understand. I'm running on a test bed > machine that has 4 nodes with two Tesla 2090 gpus per node. Each node is > dual socket and populated > with Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz processors. These are 8 > core processors and so each > node has 16 cores. On the gpu, I'm running with Paul's latest version of > the txpetscgpu package. I'm > running the src/ksp/ksp/examples/tutorials/ex2.c petsc example with > m=n=10000. My objective was > to compare the performance running on 4 nodes using all 8 gpus to that of > running on the same 4 nodes > with all 64 cores. This problem uses about a third of the memory > available on the gpus. I was using cg > with jacobi preconditioning on both the gpu run and the cpu run. What is > puzzling to me is that the cpu > case ran 44x times slower than the gpu case and the big difference was in > the time spend in functions > like VecTDot, VecNorm and VecAXPY. > > Below is a table that summarizes the performance of the main functions > that were using time in the > two runs. Times are in seconds. > > | GPU | CPU | Ratio > ------------------------------------------------------------------------- > MatMult | 450.64 | 5484.7 | 12.17 > ------------------------------------------------------------------------- > VecTDot | 285.35 | 16688.0 | 58.48 > ------------------------------------------------------------------------- > VecNorm | 19.03 | 9058.8 | 476.03 > ------------------------------------------------------------------------- > VecAXPY | 106.88 | 5636.3 | 52.73 > ------------------------------------------------------------------------- > VecAYPX | 53.69 | 85.1 | 1.58 > ------------------------------------------------------------------------- > KSPSolve | 811.95 | 35930.0 | 44.25 > ------------------------------------------------------------------------- > > The ratio of MatMult for CPU versus GPU is what I typically see when I am > comparing a CPU run on > a single core versus a run on a single GPU. Since both runs are > communicating across node via mpi, > I'm puzzled about why the CPU case is so much slower than the GPU case > especially since there is > communication for the MatMult as well. Both runs compute the same final > error norm using the exact > same number of iterations. Do these results make sense to people who > understand the performance > issues of parallel sparse linear solvers much better than I? Or do these > results look abnormal. I had > wondered if part of the performance issue was related to my running 8 > times as many mpi processes > for the CPU case. However, I ran a smaller problem with m=n=1000 and > using 8 mpi processes and > 2 cores per node and I see the same extreme differences in the times spent > in VecTDot, VecNorm > and VecAXPY. > > Here are the command lines I used for the two runs: > > CPU: > > mpirun -np 64 -mca btl self,sm,openib ex2 -m 10000 -n 10000 -ksp_type cg > -ksp_max_it 100000 -pc_type jacobi -log_summary -options_left > > GPU: > > mpirun -np 8 -npernode 2 -mca btl self,sm,openib ex2 -m 10000 -n 10000 > -ksp_type cg -ksp_max_it 100000 -pc_type jacobi -log_summary -options_left > -mat_type aijcusp -vec_type cusp -cusp_storage_format dia > > Thanks, > > Dave > > -- > Dave Nystrom > LANL HPC-5 > Phone: 505-667-7913 > Email: wdn at lanl.gov > Smail: Mail Stop B272 > Group HPC-5 > Los Alamos National Laboratory > Los Alamos, NM 87545 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20120223/ddb3115e/attachment.html>
