Hi, In other words, for my CFD code, it is not possible to parallelize it effectively because the problem is too small?
Is these true for all parallel solver, or just PETSc? I was hoping to reduce the runtime since mine is an unsteady problem which requires many steps to reach a periodic state and it takes many hours to reach it. Lastly, if I'm running on 2 processors, will there be improvement likely? Thank you. On 2/11/07, Barry Smith <bsmith at mcs.anl.gov> wrote: > > > > On Sat, 10 Feb 2007, Ben Tay wrote: > > > Hi, > > > > I've repeated the test with n,m = 800. Now serial takes around 11mins > while > > parallel with 4 processors took 6mins. Does it mean that the problem > must be > > pretty large before it is more superior to use parallel? Moreover > 800x800 > > means there's 640000 unknowns. My problem is a 2D CFD code which > typically > > has 200x80=16000 unknowns. Does it mean that I won't be able to benefit > from > ^^^^^^^^^^^ > You'll never get much performance past 2 processors; its not even worth > all the work of having a parallel code in this case. I'd just optimize the > heck out of the serial code. > > Barry > > > > > running in parallel? > > > > Btw, this is the parallel's log_summary: > > > > > > Event Count Time (sec) > > Flops/sec --- Global --- --- Stage --- Total > > Max Ratio Max Ratio Max Ratio Mess Avg len > > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > > > ------------------------------------------------------------------------------------------------------------------------ > > > > --- Event Stage 0: Main Stage > > > > MatMult 1265 1.0 7.0615e+01 1.2 3.22e+07 1.2 7.6e+03 6.4e+03 > > 0.0e+00 16 11100100 0 16 11100100 0 103 > > MatSolve 1265 1.0 4.7820e+01 1.2 4.60e+07 1.2 0.0e+00 0.0e+00 > > 0.0e+00 11 11 0 0 0 11 11 0 0 0 152 > > MatLUFactorNum 1 1.0 2.5703e-01 2.3 1.27e+07 2.3 0.0e+00 0.0e+00 > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 22 > > MatILUFactorSym 1 1.0 1.8933e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00 > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatAssemblyBegin 1 1.0 4.2153e-01 3.5 0.00e+00 0.0 0.0e+00 0.0e+00 > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > MatAssemblyEnd 1 1.0 3.6475e-01 1.5 0.00e+00 0.0 6.0e+00 3.2e+03 > > 1.3e+01 0 0 0 0 0 0 0 0 0 0 0 > > MatGetOrdering 1 1.0 1.2088e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > VecMDot 1224 1.0 1.5314e+02 1.2 4.63e+07 1.2 0.0e+00 0.0e+00 > > 1.2e+03 36 36 0 0 31 36 36 0 0 31 158 > > VecNorm 1266 1.0 1.0215e+02 1.1 4.31e+06 1.1 0.0e+00 0.0e+00 > > 1.3e+03 24 2 0 0 33 24 2 0 0 33 16 > > VecScale 1265 1.0 3.7467e+00 1.5 8.34e+07 1.5 0.0e+00 0.0e+00 > > 0.0e+00 1 1 0 0 0 1 1 0 0 0 216 > > VecCopy 41 1.0 2.5530e-01 2.8 0.00e+00 0.0 0.0e+00 0.0e+00 > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > VecSet 1308 1.0 3.2717e+00 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 > > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > VecAXPY 82 1.0 5.3338e-01 2.8 1.40e+08 2.8 0.0e+00 0.0e+00 > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 197 > > VecMAXPY 1265 1.0 4.6234e+01 1.2 1.74e+08 1.2 0.0e+00 0.0e+00 > > 0.0e+00 10 38 0 0 0 10 38 0 0 0 557 > > VecScatterBegin 1265 1.0 1.5684e-01 1.6 0.00e+00 0.0 7.6e+03 6.4e+03 > > 0.0e+00 0 0100100 0 0 0100100 0 0 > > VecScatterEnd 1265 1.0 4.3167e+01 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 > > 0.0e+00 9 0 0 0 0 9 0 0 0 0 0 > > VecNormalize 1265 1.0 1.0459e+02 1.1 6.21e+06 1.1 0.0e+00 0.0e+00 > > 1.3e+03 25 4 0 0 32 25 4 0 0 32 23 > > KSPGMRESOrthog 1224 1.0 1.9035e+02 1.1 7.00e+07 1.1 0.0e+00 0.0e+00 > > 1.2e+03 45 72 0 0 31 45 72 0 0 31 254 > > KSPSetup 2 1.0 5.1674e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > > 1.0e+01 0 0 0 0 0 0 0 0 0 0 0 > > KSPSolve 1 1.0 4.0269e+02 1.0 4.16e+07 1.0 7.6e+03 6.4e+03 > > 3.9e+03 99100100100 99 99100100100 99 166 > > PCSetUp 2 1.0 4.5924e-01 2.6 8.23e+06 2.6 0.0e+00 0.0e+00 > > 6.0e+00 0 0 0 0 0 0 0 0 0 0 12 > > PCSetUpOnBlocks 1 1.0 4.5847e-01 2.6 8.26e+06 2.6 0.0e+00 0.0e+00 > > 4.0e+00 0 0 0 0 0 0 0 0 0 0 13 > > PCApply 1265 1.0 5.0990e+01 1.2 4.33e+07 1.2 0.0e+00 0.0e+00 > > 1.3e+03 12 11 0 0 32 12 11 0 0 32 143 > > > ------------------------------------------------------------------------------------------------------------------------ > > > > Memory usage is given in bytes: > > > > Object Type Creations Destructions Memory Descendants' > Mem. > > > > --- Event Stage 0: Main Stage > > > > Matrix 4 4 643208 0 > > Index Set 5 5 1924296 0 > > Vec 41 41 47379984 0 > > Vec Scatter 1 1 0 0 > > Krylov Solver 2 2 16880 0 > > Preconditioner 2 2 196 0 > > > ======================================================================================================================== > > Average time to get PetscTime(): 1.00136e-06 > > Average time for MPI_Barrier(): 4.00066e-05 > > Average time for zero size MPI_Send(): 1.70469e-05 > > OptionTable: -log_summary > > Compiled without FORTRAN kernels > > Compiled with full precision matrices (default) > > sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 4 > > sizeof(PetscScalar) 8 > > Configure run at: Thu Jan 18 12:23:31 2007 > > Configure options: --with-vendor-compilers=intel --with-x=0 > --with-shared > > --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/lib/32 > > --with-mpi-dir=/opt/mpich/myrinet/intel/ > > ----------------------------------------- > > > > > > > > > > > > > > > > On 2/10/07, Ben Tay <zonexo at gmail.com> wrote: > > > > > > Hi, > > > > > > I tried to use ex2f.F as a test code. I've changed the number n,m from > 3 > > > to 500 each. I ran the code using 1 processor and then with 4 > processor. I > > > then repeat the same with the following modification: > > > > > > > > > do i=1,10 > > > > > > call KSPSolve(ksp,b,x,ierr) > > > > > > end do > > > I've added to do loop to make the solving repeat 10 times. > > > > > > In both cases, the serial code is faster, e.g. 1 taking 2.4 min while > the > > > other 3.3 min. > > > > > > Here's the log_summary: > > > > > > > > > ---------------------------------------------- PETSc Performance > Summary: > > > ---------------------------------------------- > > > > > > ./ex2f on a linux-mpi named atlas12.nus.edu.sg with 4 processors, by > > > g0306332 Sat Feb 10 16:21:36 2007 > > > Using Petsc Release Version 2.3.2, Patch 8, Tue Jan 2 14:33:59 PST > 2007 > > > HG revision: ebeddcedcc065e32fc252af32cf1d01ed4fc7a80 > > > > > > Max Max/Min Avg Total > > > Time (sec): 2.213e+02 1.00051 2.212e+02 > > > Objects: 5.500e+01 1.00000 5.500e+01 > > > Flops: 4.718e+09 1.00019 4.718e+09 1.887e+10 > > > Flops/sec: 2.134e+07 1.00070 2.133e+07 8.531e+07 > > > > > > Memory: 3.186e+07 1.00069 1.274e+08 > > > MPI Messages: 1.832e+03 2.00000 1.374e+03 5.496e+03 > > > MPI Message Lengths: 7.324e+06 2.00000 3.998e+03 2.197e+07 > > > MPI Reductions: 7.112e+02 1.00000 > > > > > > Flop counting convention: 1 flop = 1 real number operation of type > > > (multiply/divide/add/subtract) > > > e.g., VecAXPY() for real vectors of length > N > > > --> 2N flops > > > and VecAXPY() for complex vectors of > length N > > > --> 8N flops > > > > > > Summary of Stages: ----- Time ------ ----- Flops ----- --- > Messages > > > --- -- Message Lengths -- -- Reductions -- > > > Avg %Total Avg %Total counts > > > %Total Avg %Total counts %Total > > > 0: Main Stage: 2.2120e+02 100.0% 1.8871e+10 100.0% 5.496e+03 > > > 100.0% 3.998e+03 100.0% 2.845e+03 100.0% > > > > > > > > > > > > > ------------------------------------------------------------------------------------------------------------------------ > > > See the 'Profiling' chapter of the users' manual for details on > > > interpreting output. > > > Phase summary info: > > > Count: number of times phase was executed > > > Time and Flops/sec: Max - maximum over all processors > > > Ratio - ratio of maximum to minimum over all > > > processors > > > Mess: number of messages sent > > > Avg. len: average message length > > > Reduct: number of global reductions > > > Global: entire computation > > > Stage: stages of a computation. Set stages with PetscLogStagePush() > and > > > PetscLogStagePop(). > > > %T - percent time in this phase %F - percent flops in > this > > > phase > > > %M - percent messages in this phase %L - percent message > lengths > > > in this phase > > > %R - percent reductions in this phase > > > Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time > > > over all processors) > > > > > > > > > > ------------------------------------------------------------------------------------------------------------------------ > > > > > > ########################################################## > > > # # > > > # WARNING!!! # > > > # # > > > # This code was compiled with a debugging option, # > > > # To get timing results run config/configure.py # > > > # using --with-debugging=no, the performance will # > > > # be generally two or three times faster. # > > > # # > > > ########################################################## > > > > > > > > > > > > > > > ########################################################## > > > # # > > > # WARNING!!! # > > > # # > > > # This code was run without the PreLoadBegin() # > > > # macros. To get timing results we always recommend # > > > # preloading. otherwise timing numbers may be # > > > # meaningless. # > > > ########################################################## > > > > > > > > > Event Count Time (sec) > > > Flops/sec --- Global --- --- Stage --- > Total > > > Max Ratio Max Ratio Max Ratio Mess Avg > len > > > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > > > > > > > > > > ------------------------------------------------------------------------------------------------------------------------ > > > > > > --- Event Stage 0: Main Stage > > > > > > MatMult 915 1.0 4.4291e+01 1.3 1.50e+07 1.3 5.5e+03 > 4.0e+03 > > > 0.0e+00 18 11100100 0 18 11100100 0 46 > > > MatSolve 915 1.0 1.5684e+01 1.1 3.56e+07 1.1 0.0e+00 > 0.0e+00 > > > 0.0e+00 7 11 0 0 0 7 11 0 0 0 131 > > > MatLUFactorNum 1 1.0 5.1654e-02 1.4 1.48e+07 1.4 0.0e+00 > 0.0e+00 > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 43 > > > MatILUFactorSym 1 1.0 1.6838e-02 1.1 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > MatAssemblyBegin 1 1.0 3.2428e-01 1.6 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > MatAssemblyEnd 1 1.0 1.3120e+00 1.1 0.00e+00 0.0 6.0e+00 > 2.0e+03 > > > 1.3e+01 1 0 0 0 0 1 0 0 0 0 0 > > > MatGetOrdering 1 1.0 4.1590e-03 1.2 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > VecMDot 885 1.0 8.5091e+01 1.1 2.27e+07 1.1 0.0e+00 > 0.0e+00 > > > 8.8e+02 36 36 0 0 31 36 36 0 0 31 80 > > > VecNorm 916 1.0 6.6747e+01 1.1 1.81e+06 1.1 0.0e+00 > 0.0e+00 > > > 9.2e+02 29 2 0 0 32 29 2 0 0 32 7 > > > VecScale 915 1.0 1.1430e+00 2.2 1.12e+08 2.2 0.0e+00 > 0.0e+00 > > > 0.0e+00 0 1 0 0 0 0 1 0 0 0 200 > > > VecCopy 30 1.0 1.2816e-01 5.7 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > VecSet 947 1.0 7.8979e-01 1.3 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > VecAXPY 60 1.0 5.5332e-02 1.1 1.51e+08 1.1 0.0e+00 > 0.0e+00 > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 542 > > > VecMAXPY 915 1.0 1.5004e+01 1.3 1.54e+08 1.3 0.0e+00 > 0.0e+00 > > > 0.0e+00 6 38 0 0 0 6 38 0 0 0 483 > > > VecScatterBegin 915 1.0 9.0358e-02 1.4 0.00e+00 0.0 5.5e+03 > 4.0e+03 > > > 0.0e+00 0 0100100 0 0 0100100 0 0 > > > VecScatterEnd 915 1.0 3.5136e+01 1.4 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > 0.0e+00 14 0 0 0 0 14 0 0 0 0 0 > > > VecNormalize 915 1.0 6.7272e+01 1.0 2.68e+06 1.0 0.0e+00 > 0.0e+00 > > > 9.2e+02 30 4 0 0 32 30 4 0 0 32 10 > > > KSPGMRESOrthog 885 1.0 9.8478e+01 1.1 3.87e+07 1.1 0.0e+00 > 0.0e+00 > > > 8.8e+02 42 72 0 0 31 42 72 0 0 31 138 > > > KSPSetup 2 1.0 6.1918e-01 1.2 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > 1.0e+01 0 0 0 0 0 0 0 0 0 0 0 > > > KSPSolve 1 1.0 2.1892e+02 1.0 2.15e+07 1.0 5.5e+03 > 4.0e+03 > > > 2.8e+03 99100100100 99 99100100100 99 86 > > > PCSetUp 2 1.0 7.3292e-02 1.3 9.84e+06 1.3 0.0e+00 > 0.0e+00 > > > 6.0e+00 0 0 0 0 0 0 0 0 0 0 30 > > > PCSetUpOnBlocks 1 1.0 7.2706e-02 1.3 9.97e+06 1.3 0.0e+00 > 0.0e+00 > > > 4.0e+00 0 0 0 0 0 0 0 0 0 0 31 > > > PCApply 915 1.0 1.6508e+01 1.1 3.27e+07 1.1 0.0e+00 > 0.0e+00 > > > 9.2e+02 7 11 0 0 32 7 11 0 0 32 124 > > > > > > > ------------------------------------------------------------------------------------------------------------------------ > > > > > > > > > Memory usage is given in bytes: > > > > > > Object Type Creations Destructions Memory Descendants' > Mem. > > > > > > --- Event Stage 0: Main Stage > > > > > > Matrix 4 4 252008 0 > > > Index Set 5 5 753096 0 > > > Vec 41 41 18519984 0 > > > Vec Scatter 1 1 0 0 > > > Krylov Solver 2 2 16880 0 > > > Preconditioner 2 2 196 0 > > > > ======================================================================================================================== > > > > > > Average time to get PetscTime(): 1.09673e-06 > > > Average time for MPI_Barrier(): 4.18186e-05 > > > Average time for zero size MPI_Send(): 2.62856e-05 > > > OptionTable: -log_summary > > > Compiled without FORTRAN kernels > > > Compiled with full precision matrices (default) > > > sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 4 > > > sizeof(PetscScalar) 8 > > > Configure run at: Thu Jan 18 12:23:31 2007 > > > Configure options: --with-vendor-compilers=intel --with-x=0 > --with-shared > > > --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/lib/32 > > > --with-mpi-dir=/opt/mpich/myrinet/intel/ > > > ----------------------------------------- > > > Libraries compiled on Thu Jan 18 12:24:41 SGT 2007 on > atlas1.nus.edu.sg > > > Machine characteristics: Linux atlas1.nus.edu.sg 2.4.21-20.ELsmp #1 > SMP > > > Wed Sep 8 17:29:34 GMT 2004 i686 i686 i386 GNU/Linux > > > Using PETSc directory: /nas/lsftmp/g0306332/petsc-2.3.2-p8 > > > Using PETSc arch: linux-mpif90 > > > ----------------------------------------- > > > Using C compiler: /opt/mpich/myrinet/intel/bin/mpicc -fPIC -g > > > Using Fortran compiler: /opt/mpich/myrinet/intel/bin/mpif90 -I. -fPIC > -g > > > -w90 -w > > > ----------------------------------------- > > > Using include paths: -I/nas/lsftmp/g0306332/petsc- > > > 2.3.2-p8-I/nas/lsftmp/g0306332/petsc- > > > 2.3.2-p8/bmake/linux-mpif90 -I/nas/lsftmp/g0306332/petsc-2.3.2-p8 > /include > > > -I/opt/mpich/myrinet/intel/include > > > ------------------------------------------ > > > Using C linker: /opt/mpich/myrinet/intel/bin/mpicc -fPIC -g > > > Using Fortran linker: /opt/mpich/myrinet/intel/bin/mpif90 -I. -fPIC -g > > > -w90 -w > > > Using libraries: > > > -Wl,-rpath,/nas/lsftmp/g0306332/petsc-2.3.2-p8/lib/linux-mpif90 > > > -L/nas/lsftmp/g0306332/petsc-2.3.2-p8/lib/linux-mpif90 -lpetscts > > > -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc > > > -Wl,-rpath,/lsftmp/g0306332/inter/mkl/lib/32 > > > -L/lsftmp/g0306332/inter/mkl/lib/32 -lmkl_lapack -lmkl_ia32 -lguide > > > -lPEPCF90 -Wl,-rpath,/opt/intel/compiler70/ia32/lib > > > -Wl,-rpath,/opt/mpich/myrinet/intel/lib -L/opt/mpich/myrinet/intel/lib > > > -Wl,-rpath,-rpath -Wl,-rpath,-ldl -L-ldl -lmpich -Wl,-rpath,-L -lgm > > > -lpthread -Wl,-rpath,/opt/intel/compiler70/ia32/lib > > > -Wl,-rpath,/opt/intel/compiler70/ia32/lib > -L/opt/intel/compiler70/ia32/lib > > > -Wl,-rpath,/usr/lib -Wl,-rpath,/usr/lib -L/usr/lib -limf -lirc -lcprts > -lcxa > > > -lunwind -ldl -lmpichf90 -Wl,-rpath,/opt/gm/lib -L/opt/gm/lib > -lPEPCF90 > > > -Wl,-rpath,/opt/intel/compiler70/ia32/lib > -L/opt/intel/compiler70/ia32/lib > > > -Wl,-rpath,/usr/lib -L/usr/lib -lintrins -lIEPCF90 -lF90 > -lm -Wl,-rpath,\ > > > -Wl,-rpath,\ -L\ -ldl -lmpich -Wl,-rpath,\ -L\ -lgm -lpthread > > > -Wl,-rpath,/opt/intel/compiler70/ia32/lib > -L/opt/intel/compiler70/ia32/lib > > > -Wl,-rpath,/usr/lib -L/usr/lib -limf -lirc -lcprts -lcxa -lunwind -ldl > > > ------------------------------------------ > > > > > > So is there something wrong with the server's mpi implementation? > > > > > > Thank you. > > > > > > > > > > > > On 2/10/07, Satish Balay <balay at mcs.anl.gov> wrote: > > > > > > > > Looks like MatMult = 24sec Out of this the scatter time is: 22sec. > > > > Either something is wrong with your run - or MPI is really broken.. > > > > > > > > Satish > > > > > > > > > > > MatMult 3927 1.0 2.4071e+01 1.3 6.14e+06 1.4 > 2.4e+04 > > > > 1.3e+03 > > > > > > > VecScatterBegin 3927 1.0 2.8672e-01 3.9 0.00e+00 0.0 > 2.4e+04 > > > > 1.3e+03 > > > > > > > VecScatterEnd 3927 1.0 2.2135e+01 1.5 0.00e+00 0.0 > 0.0e+00 > > > > 0.0e+00 > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20070211/f4fbb50d/attachment.htm>