Hi,

In other words, for my CFD code, it is not possible to parallelize it
effectively because the problem is too small?

Is these true for all parallel solver, or just PETSc? I was hoping to reduce
the runtime since mine is an unsteady problem which requires many steps to
reach a periodic state and it takes many hours to reach it.

Lastly, if I'm running on 2 processors, will there be improvement likely?

Thank you.


On 2/11/07, Barry Smith <bsmith at mcs.anl.gov> wrote:
>
>
>
> On Sat, 10 Feb 2007, Ben Tay wrote:
>
> > Hi,
> >
> > I've repeated the test with n,m = 800. Now serial takes around 11mins
> while
> > parallel with 4 processors took 6mins. Does it mean that the problem
> must be
> > pretty large before it is more superior to use parallel?  Moreover
> 800x800
> > means there's 640000 unknowns. My problem is a 2D CFD code which
> typically
> > has 200x80=16000 unknowns. Does it mean that I won't be able to benefit
> from
>      ^^^^^^^^^^^
> You'll never get much performance past 2 processors; its not even worth
> all the work of having a parallel code in this case. I'd just optimize the
> heck out of the serial code.
>
>   Barry
>
>
>
> > running in parallel?
> >
> > Btw, this is the parallel's log_summary:
> >
> >
> > Event                Count      Time (sec)
> > Flops/sec                         --- Global ---  --- Stage ---   Total
> >                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
> > Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> >
> ------------------------------------------------------------------------------------------------------------------------
> >
> > --- Event Stage 0: Main Stage
> >
> > MatMult             1265 1.0 7.0615e+01 1.2 3.22e+07 1.2 7.6e+03 6.4e+03
> > 0.0e+00 16 11100100  0  16 11100100  0   103
> > MatSolve            1265 1.0 4.7820e+01 1.2 4.60e+07 1.2 0.0e+00 0.0e+00
> > 0.0e+00 11 11  0  0  0  11 11  0  0  0   152
> > MatLUFactorNum         1 1.0 2.5703e-01 2.3 1.27e+07 2.3 0.0e+00 0.0e+00
> > 0.0e+00  0  0  0  0  0   0  0  0  0  0    22
> > MatILUFactorSym        1 1.0 1.8933e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00
> > 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatAssemblyBegin       1 1.0 4.2153e-01 3.5 0.00e+00 0.0 0.0e+00 0.0e+00
> > 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > MatAssemblyEnd         1 1.0 3.6475e-01 1.5 0.00e+00 0.0 6.0e+00 3.2e+03
> > 1.3e+01  0  0  0  0  0   0  0  0  0  0     0
> > MatGetOrdering         1 1.0 1.2088e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> > 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > VecMDot             1224 1.0 1.5314e+02 1.2 4.63e+07 1.2 0.0e+00 0.0e+00
> > 1.2e+03 36 36  0  0 31  36 36  0  0 31   158
> > VecNorm             1266 1.0 1.0215e+02 1.1 4.31e+06 1.1 0.0e+00 0.0e+00
> > 1.3e+03 24  2  0  0 33  24  2  0  0 33    16
> > VecScale            1265 1.0 3.7467e+00 1.5 8.34e+07 1.5 0.0e+00 0.0e+00
> > 0.0e+00  1  1  0  0  0   1  1  0  0  0   216
> > VecCopy               41 1.0 2.5530e-01 2.8 0.00e+00 0.0 0.0e+00 0.0e+00
> > 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > VecSet              1308 1.0 3.2717e+00 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
> > 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> > VecAXPY               82 1.0 5.3338e-01 2.8 1.40e+08 2.8 0.0e+00 0.0e+00
> > 0.0e+00  0  0  0  0  0   0  0  0  0  0   197
> > VecMAXPY            1265 1.0 4.6234e+01 1.2 1.74e+08 1.2 0.0e+00 0.0e+00
> > 0.0e+00 10 38  0  0  0  10 38  0  0  0   557
> > VecScatterBegin     1265 1.0 1.5684e-01 1.6 0.00e+00 0.0 7.6e+03 6.4e+03
> > 0.0e+00  0  0100100  0   0  0100100  0     0
> > VecScatterEnd       1265 1.0 4.3167e+01 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
> > 0.0e+00  9  0  0  0  0   9  0  0  0  0     0
> > VecNormalize        1265 1.0 1.0459e+02 1.1 6.21e+06 1.1 0.0e+00 0.0e+00
> > 1.3e+03 25  4  0  0 32  25  4  0  0 32    23
> > KSPGMRESOrthog      1224 1.0 1.9035e+02 1.1 7.00e+07 1.1 0.0e+00 0.0e+00
> > 1.2e+03 45 72  0  0 31  45 72  0  0 31   254
> > KSPSetup               2 1.0 5.1674e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> > 1.0e+01  0  0  0  0  0   0  0  0  0  0     0
> > KSPSolve               1 1.0 4.0269e+02 1.0 4.16e+07 1.0 7.6e+03 6.4e+03
> > 3.9e+03 99100100100 99  99100100100 99   166
> > PCSetUp                2 1.0 4.5924e-01 2.6 8.23e+06 2.6 0.0e+00 0.0e+00
> > 6.0e+00  0  0  0  0  0   0  0  0  0  0    12
> > PCSetUpOnBlocks        1 1.0 4.5847e-01 2.6 8.26e+06 2.6 0.0e+00 0.0e+00
> > 4.0e+00  0  0  0  0  0   0  0  0  0  0    13
> > PCApply             1265 1.0 5.0990e+01 1.2 4.33e+07 1.2 0.0e+00 0.0e+00
> > 1.3e+03 12 11  0  0 32  12 11  0  0 32   143
> >
> ------------------------------------------------------------------------------------------------------------------------
> >
> > Memory usage is given in bytes:
> >
> > Object Type          Creations   Destructions   Memory  Descendants'
> Mem.
> >
> > --- Event Stage 0: Main Stage
> >
> >              Matrix     4              4     643208     0
> >           Index Set     5              5    1924296     0
> >                 Vec    41             41   47379984     0
> >         Vec Scatter     1              1          0     0
> >       Krylov Solver     2              2      16880     0
> >      Preconditioner     2              2        196     0
> >
> ========================================================================================================================
> > Average time to get PetscTime(): 1.00136e-06
> > Average time for MPI_Barrier(): 4.00066e-05
> > Average time for zero size MPI_Send(): 1.70469e-05
> > OptionTable: -log_summary
> > Compiled without FORTRAN kernels
> > Compiled with full precision matrices (default)
> > sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 4
> > sizeof(PetscScalar) 8
> > Configure run at: Thu Jan 18 12:23:31 2007
> > Configure options: --with-vendor-compilers=intel --with-x=0
> --with-shared
> > --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/lib/32
> > --with-mpi-dir=/opt/mpich/myrinet/intel/
> > -----------------------------------------
> >
> >
> >
> >
> >
> >
> >
> > On 2/10/07, Ben Tay <zonexo at gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > I tried to use ex2f.F as a test code. I've changed the number n,m from
> 3
> > > to 500 each. I ran the code using 1 processor and then with 4
> processor. I
> > > then repeat the same with the following modification:
> > >
> > >
> > > do i=1,10
> > >
> > >       call KSPSolve(ksp,b,x,ierr)
> > >
> > > end do
> > > I've added to do loop to make the solving repeat 10 times.
> > >
> > > In both cases, the serial code is faster, e.g. 1 taking 2.4 min while
> the
> > > other 3.3 min.
> > >
> > > Here's the log_summary:
> > >
> > >
> > > ---------------------------------------------- PETSc Performance
> Summary:
> > > ----------------------------------------------
> > >
> > > ./ex2f on a linux-mpi named atlas12.nus.edu.sg with 4 processors, by
> > > g0306332 Sat Feb 10 16:21:36 2007
> > > Using Petsc Release Version 2.3.2, Patch 8, Tue Jan  2 14:33:59 PST
> 2007
> > > HG revision: ebeddcedcc065e32fc252af32cf1d01ed4fc7a80
> > >
> > >                          Max       Max/Min        Avg      Total
> > > Time (sec):           2.213e+02      1.00051   2.212e+02
> > > Objects:              5.500e+01      1.00000   5.500e+01
> > > Flops:                4.718e+09      1.00019   4.718e+09  1.887e+10
> > > Flops/sec:            2.134e+07       1.00070   2.133e+07  8.531e+07
> > >
> > > Memory:               3.186e+07      1.00069              1.274e+08
> > > MPI Messages:         1.832e+03      2.00000   1.374e+03  5.496e+03
> > > MPI Message Lengths:  7.324e+06       2.00000   3.998e+03  2.197e+07
> > > MPI Reductions:       7.112e+02      1.00000
> > >
> > > Flop counting convention: 1 flop = 1 real number operation of type
> > > (multiply/divide/add/subtract)
> > >                             e.g., VecAXPY() for real vectors of length
> N
> > > --> 2N flops
> > >                             and VecAXPY() for complex vectors of
> length N
> > > --> 8N flops
> > >
> > > Summary of Stages:   ----- Time ------  ----- Flops -----  ---
> Messages
> > > ---  -- Message Lengths --  -- Reductions --
> > >                         Avg     %Total     Avg     %Total   counts
> > > %Total     Avg         %Total   counts   %Total
> > >  0:      Main Stage: 2.2120e+02 100.0%  1.8871e+10 100.0%  5.496e+03
> > > 100.0%  3.998e+03      100.0%  2.845e+03 100.0%
> > >
> > >
> > >
> > >
> ------------------------------------------------------------------------------------------------------------------------
> > > See the 'Profiling' chapter of the users' manual for details on
> > > interpreting output.
> > > Phase summary info:
> > >    Count: number of times phase was executed
> > >    Time and Flops/sec: Max - maximum over all processors
> > >                        Ratio - ratio of maximum to minimum over all
> > > processors
> > >    Mess: number of messages sent
> > >    Avg. len: average message length
> > >    Reduct: number of global reductions
> > >    Global: entire computation
> > >    Stage: stages of a computation. Set stages with PetscLogStagePush()
> and
> > > PetscLogStagePop().
> > >       %T - percent time in this phase         %F - percent flops in
> this
> > > phase
> > >       %M - percent messages in this phase     %L - percent message
> lengths
> > > in this phase
> > >       %R - percent reductions in this phase
> > >    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
> > > over all processors)
> > >
> > >
> > >
> ------------------------------------------------------------------------------------------------------------------------
> > >
> > >       ##########################################################
> > >       #                                                        #
> > >       #                          WARNING!!!                    #
> > >       #                                                        #
> > >       #   This code was compiled with a debugging option,      #
> > >       #   To get timing results run config/configure.py        #
> > >       #   using --with-debugging=no, the performance will      #
> > >       #   be generally two or three times faster.              #
> > >       #                                                        #
> > >       ##########################################################
> > >
> > >
> > >
> > >
> > >       ##########################################################
> > >       #                                                        #
> > >       #                          WARNING!!!                    #
> > >       #                                                        #
> > >       #   This code was run without the PreLoadBegin()         #
> > >       #   macros. To get timing results we always recommend    #
> > >       #   preloading. otherwise timing numbers may be          #
> > >       #   meaningless.                                         #
> > >       ##########################################################
> > >
> > >
> > > Event                Count      Time (sec)
> > > Flops/sec                         --- Global ---  --- Stage ---
> Total
> > >                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg
> len
> > > Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> > >
> > >
> > >
> ------------------------------------------------------------------------------------------------------------------------
> > >
> > > --- Event Stage 0: Main Stage
> > >
> > > MatMult              915 1.0 4.4291e+01 1.3 1.50e+07 1.3 5.5e+03
> 4.0e+03
> > > 0.0e+00 18 11100100  0  18 11100100  0    46
> > > MatSolve             915 1.0 1.5684e+01 1.1 3.56e+07 1.1 0.0e+00
> 0.0e+00
> > > 0.0e+00  7 11  0  0  0   7 11  0  0  0   131
> > > MatLUFactorNum         1 1.0 5.1654e-02 1.4 1.48e+07 1.4 0.0e+00
> 0.0e+00
> > > 0.0e+00  0  0  0  0  0   0  0  0  0  0    43
> > > MatILUFactorSym        1 1.0 1.6838e-02 1.1 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > MatAssemblyBegin       1 1.0 3.2428e-01 1.6 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > MatAssemblyEnd         1 1.0 1.3120e+00 1.1 0.00e+00 0.0 6.0e+00
> 2.0e+03
> > > 1.3e+01  1  0  0  0  0   1  0  0  0  0     0
> > > MatGetOrdering         1 1.0 4.1590e-03 1.2 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > VecMDot              885 1.0 8.5091e+01 1.1 2.27e+07 1.1 0.0e+00
> 0.0e+00
> > > 8.8e+02 36 36  0  0 31  36 36  0  0 31    80
> > > VecNorm              916 1.0 6.6747e+01 1.1 1.81e+06 1.1 0.0e+00
> 0.0e+00
> > > 9.2e+02 29  2  0  0 32  29  2  0  0 32     7
> > > VecScale             915 1.0 1.1430e+00 2.2 1.12e+08 2.2 0.0e+00
> 0.0e+00
> > > 0.0e+00  0  1  0  0  0   0  1  0  0  0   200
> > > VecCopy               30 1.0 1.2816e-01 5.7 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > VecSet               947 1.0 7.8979e-01 1.3 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > VecAXPY               60 1.0 5.5332e-02 1.1 1.51e+08 1.1 0.0e+00
> 0.0e+00
> > > 0.0e+00  0  0  0  0  0   0  0  0  0  0   542
> > > VecMAXPY             915 1.0 1.5004e+01 1.3 1.54e+08 1.3 0.0e+00
> 0.0e+00
> > > 0.0e+00  6 38  0  0  0   6 38  0  0  0   483
> > > VecScatterBegin      915 1.0 9.0358e-02 1.4 0.00e+00 0.0 5.5e+03
> 4.0e+03
> > > 0.0e+00  0  0100100  0   0  0100100  0     0
> > > VecScatterEnd        915 1.0 3.5136e+01 1.4 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > 0.0e+00 14  0  0  0  0  14  0  0  0  0     0
> > > VecNormalize         915 1.0 6.7272e+01 1.0 2.68e+06 1.0 0.0e+00
> 0.0e+00
> > > 9.2e+02 30  4  0  0 32  30  4  0  0 32    10
> > > KSPGMRESOrthog       885 1.0 9.8478e+01 1.1 3.87e+07 1.1 0.0e+00
> 0.0e+00
> > > 8.8e+02 42 72  0  0 31  42 72  0  0 31   138
> > > KSPSetup               2 1.0 6.1918e-01 1.2 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > 1.0e+01  0  0  0  0  0   0  0  0  0  0     0
> > > KSPSolve               1 1.0 2.1892e+02 1.0 2.15e+07 1.0 5.5e+03
> 4.0e+03
> > > 2.8e+03 99100100100 99  99100100100 99    86
> > > PCSetUp                2 1.0 7.3292e-02 1.3 9.84e+06 1.3 0.0e+00
> 0.0e+00
> > > 6.0e+00  0  0  0  0  0   0  0  0  0  0    30
> > > PCSetUpOnBlocks        1 1.0 7.2706e-02 1.3 9.97e+06 1.3 0.0e+00
> 0.0e+00
> > > 4.0e+00  0  0  0  0  0   0  0  0  0  0    31
> > > PCApply              915 1.0 1.6508e+01 1.1 3.27e+07 1.1 0.0e+00
> 0.0e+00
> > > 9.2e+02  7 11  0  0 32   7 11  0  0 32   124
> > >
> > >
> ------------------------------------------------------------------------------------------------------------------------
> > >
> > >
> > > Memory usage is given in bytes:
> > >
> > > Object Type          Creations   Destructions   Memory  Descendants'
> Mem.
> > >
> > > --- Event Stage 0: Main Stage
> > >
> > >               Matrix     4              4     252008     0
> > >            Index Set     5              5     753096     0
> > >                  Vec    41             41   18519984     0
> > >          Vec Scatter     1              1          0     0
> > >        Krylov Solver     2              2      16880     0
> > >       Preconditioner     2              2        196     0
> > >
> ========================================================================================================================
> > >
> > > Average time to get PetscTime(): 1.09673e-06
> > > Average time for MPI_Barrier(): 4.18186e-05
> > > Average time for zero size MPI_Send(): 2.62856e-05
> > > OptionTable: -log_summary
> > > Compiled without FORTRAN kernels
> > > Compiled with full precision matrices (default)
> > > sizeof(short) 2 sizeof(int) 4 sizeof(long) 4 sizeof(void*) 4
> > > sizeof(PetscScalar) 8
> > > Configure run at: Thu Jan 18 12:23:31 2007
> > > Configure options: --with-vendor-compilers=intel --with-x=0
> --with-shared
> > > --with-blas-lapack-dir=/lsftmp/g0306332/inter/mkl/lib/32
> > > --with-mpi-dir=/opt/mpich/myrinet/intel/
> > > -----------------------------------------
> > > Libraries compiled on Thu Jan 18 12:24:41 SGT 2007 on
> atlas1.nus.edu.sg
> > > Machine characteristics: Linux atlas1.nus.edu.sg 2.4.21-20.ELsmp #1
> SMP
> > > Wed Sep 8 17:29:34 GMT 2004 i686 i686 i386 GNU/Linux
> > > Using PETSc directory: /nas/lsftmp/g0306332/petsc-2.3.2-p8
> > > Using PETSc arch: linux-mpif90
> > > -----------------------------------------
> > > Using C compiler: /opt/mpich/myrinet/intel/bin/mpicc -fPIC -g
> > > Using Fortran compiler: /opt/mpich/myrinet/intel/bin/mpif90 -I. -fPIC
> -g
> > > -w90 -w
> > > -----------------------------------------
> > > Using include paths: -I/nas/lsftmp/g0306332/petsc-
> > > 2.3.2-p8-I/nas/lsftmp/g0306332/petsc-
> > > 2.3.2-p8/bmake/linux-mpif90 -I/nas/lsftmp/g0306332/petsc-2.3.2-p8
> /include
> > > -I/opt/mpich/myrinet/intel/include
> > > ------------------------------------------
> > > Using C linker: /opt/mpich/myrinet/intel/bin/mpicc -fPIC -g
> > > Using Fortran linker: /opt/mpich/myrinet/intel/bin/mpif90 -I. -fPIC -g
> > > -w90 -w
> > > Using libraries:
> > > -Wl,-rpath,/nas/lsftmp/g0306332/petsc-2.3.2-p8/lib/linux-mpif90
> > > -L/nas/lsftmp/g0306332/petsc-2.3.2-p8/lib/linux-mpif90 -lpetscts
> > > -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc
> > > -Wl,-rpath,/lsftmp/g0306332/inter/mkl/lib/32
> > > -L/lsftmp/g0306332/inter/mkl/lib/32 -lmkl_lapack -lmkl_ia32 -lguide
> > > -lPEPCF90 -Wl,-rpath,/opt/intel/compiler70/ia32/lib
> > > -Wl,-rpath,/opt/mpich/myrinet/intel/lib -L/opt/mpich/myrinet/intel/lib
> > > -Wl,-rpath,-rpath -Wl,-rpath,-ldl -L-ldl -lmpich -Wl,-rpath,-L -lgm
> > > -lpthread -Wl,-rpath,/opt/intel/compiler70/ia32/lib
> > > -Wl,-rpath,/opt/intel/compiler70/ia32/lib
> -L/opt/intel/compiler70/ia32/lib
> > > -Wl,-rpath,/usr/lib -Wl,-rpath,/usr/lib -L/usr/lib -limf -lirc -lcprts
> -lcxa
> > > -lunwind -ldl -lmpichf90 -Wl,-rpath,/opt/gm/lib -L/opt/gm/lib
> -lPEPCF90
> > > -Wl,-rpath,/opt/intel/compiler70/ia32/lib
> -L/opt/intel/compiler70/ia32/lib
> > > -Wl,-rpath,/usr/lib -L/usr/lib -lintrins -lIEPCF90 -lF90
> -lm  -Wl,-rpath,\
> > > -Wl,-rpath,\ -L\ -ldl -lmpich -Wl,-rpath,\ -L\ -lgm -lpthread
> > > -Wl,-rpath,/opt/intel/compiler70/ia32/lib
> -L/opt/intel/compiler70/ia32/lib
> > > -Wl,-rpath,/usr/lib -L/usr/lib -limf -lirc -lcprts -lcxa -lunwind -ldl
> > > ------------------------------------------
> > >
> > >  So is there something wrong with the server's mpi implementation?
> > >
> > > Thank you.
> > >
> > >
> > >
> > > On 2/10/07, Satish Balay <balay at mcs.anl.gov> wrote:
> > > >
> > > > Looks like MatMult = 24sec Out of this the scatter time is: 22sec.
> > > > Either something is wrong with your run - or MPI is really broken..
> > > >
> > > > Satish
> > > >
> > > > > > > MatMult             3927 1.0 2.4071e+01 1.3 6.14e+06 1.4
> 2.4e+04
> > > > 1.3e+03
> > > > > > > VecScatterBegin     3927 1.0 2.8672e-01 3.9 0.00e+00 0.0
> 2.4e+04
> > > > 1.3e+03
> > > > > > > VecScatterEnd       3927 1.0 2.2135e+01 1.5 0.00e+00 0.0
> 0.0e+00
> > > > 0.0e+00
> > > >
> > > >
> > >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20070211/f4fbb50d/attachment.htm>

Reply via email to