Slow speed after changing from serial to parallel

Matthew Knepley Tue, 15 Apr 2008 12:33:46 -0500

The convergence here is jsut horrendous. Have you tried using LU to check
your implementation? All the time is in the solve right now. I would first
try a direct method (at least on a small problem) and then try to understand
the convergence behavior. MUMPS can actually scale very well for big problems.


  Matt

On Tue, Apr 15, 2008 at 11:44 AM, Ben Tay <zonexo at gmail.com> wrote:
> Hi,
>
>  Here's the summary for 1 processor. Seems like it's also using a long
> time... Can someone tell me when my mistakes possibly lie? Thank you very
> much!
>
>
> ************************************************************************************************************************
>  ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
> -fCourier9' to print this document            ***
>
> ************************************************************************************************************************
>
>  ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
>  ./a.out on a atlas3-mp named atlas3-c45 with 1 processor, by g0306332 Wed
> Apr 16 00:39:22 2008
>  Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG
> revision: 414581156e67e55c761739b0deb119f7590d0f4b
>
>                         Max       Max/Min        Avg      Total
>  Time (sec):           1.088e+03      1.00000   1.088e+03
>  Objects:              4.300e+01      1.00000   4.300e+01
>  Flops:                2.658e+11      1.00000   2.658e+11  2.658e+11
>  Flops/sec:            2.444e+08      1.00000   2.444e+08  2.444e+08
>  MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
>  MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
>  MPI Reductions:       1.460e+04      1.00000
>
>  Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
>                            e.g., VecAXPY() for real vectors of length N -->
> 2N flops
>                            and VecAXPY() for complex vectors of length N -->
> 8N flops
>
>  Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---
> -- Message Lengths --  -- Reductions --
>                        Avg     %Total     Avg     %Total   counts   %Total
> Avg         %Total   counts   %Total
>  0:      Main Stage: 1.0877e+03 100.0%  2.6584e+11 100.0%  0.000e+00   0.0%
> 0.000e+00        0.0%  1.460e+04 100.0%
>
>
> ------------------------------------------------------------------------------------------------------------------------
>  See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
>  Phase summary info:
>   Count: number of times phase was executed
>   Time and Flops/sec: Max - maximum over all processors
>                       Ratio - ratio of maximum to minimum over all
> processors
>   Mess: number of messages sent
>   Avg. len: average message length
>   Reduct: number of global reductions
>   Global: entire computation
>   Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
>      %T - percent time in this phase         %F - percent flops in this
> phase
>      %M - percent messages in this phase     %L - percent message lengths in
> this phase
>      %R - percent reductions in this phase
>   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
> all processors)
>
> ------------------------------------------------------------------------------------------------------------------------
>
>
>      ##########################################################
>      #                                                        #
>      #                          WARNING!!!                    #
>      #                                                        #
>      #   This code was run without the PreLoadBegin()         #
>      #   macros. To get timing results we always recommend    #
>      #   preloading. otherwise timing numbers may be          #
>      #   meaningless.                                         #
>      #   preloading. otherwise timing numbers may be          #
>      #   meaningless.                                         #
>      ##########################################################
>
>
>  Event                Count      Time (sec)     Flops/sec
> --- Global ---  --- Stage ---   Total
>                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>
> ------------------------------------------------------------------------------------------------------------------------
>
>  --- Event Stage 0: Main Stage
>
>  MatMult             7412 1.0 1.3344e+02 1.0 2.16e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 12 11  0  0  0  12 11  0  0  0   216
>  MatSolve            7413 1.0 2.6851e+02 1.0 1.07e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 25 11  0  0  0  25 11  0  0  0   107
>  MatLUFactorNum         1 1.0 4.3947e-02 1.0 8.83e+07 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0    88
>  MatILUFactorSym        1 1.0 3.7798e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatAssemblyBegin       1 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatAssemblyEnd         1 1.0 2.5835e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatGetRowIJ            1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatGetOrdering         1 1.0 6.0391e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  MatZeroEntries         1 1.0 1.7377e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  KSPGMRESOrthog      7173 1.0 5.6323e+02 1.0 3.41e+08 1.0 0.0e+00 0.0e+00
> 7.2e+03 52 72  0  0 49  52 72  0  0 49   341
>  KSPSetup               1 1.0 1.2676e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  KSPSolve               1 1.0 1.0144e+03 1.0 2.62e+08 1.0 0.0e+00 0.0e+00
> 1.5e+04 93100  0  0100  93100  0  0100   262
>  PCSetUp                1 1.0 8.7809e-02 1.0 4.42e+07 1.0 0.0e+00 0.0e+00
> 3.0e+00  0  0  0  0  0   0  0  0  0  0    44
>  PCApply             7413 1.0 2.6853e+02 1.0 1.07e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 25 11  0  0  0  25 11  0  0  0   107
>  VecMDot             7173 1.0 2.6720e+02 1.0 3.59e+08 1.0 0.0e+00 0.0e+00
> 7.2e+03 25 36  0  0 49  25 36  0  0 49   359
>  VecNorm             7413 1.0 1.7125e+01 1.0 3.74e+08 1.0 0.0e+00 0.0e+00
> 7.4e+03  2  2  0  0 51   2  2  0  0 51   374
>  VecScale            7413 1.0 9.2787e+00 1.0 3.45e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  1  1  0  0  0   1  1  0  0  0   345
>  VecCopy              240 1.0 5.1628e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecSet               241 1.0 6.4428e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecAXPY              479 1.0 2.0082e+00 1.0 2.06e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0   206
>  VecMAXPY            7413 1.0 3.1536e+02 1.0 3.24e+08 1.0 0.0e+00 0.0e+00
> 0.0e+00 29 38  0  0  0  29 38  0  0  0   324
>  VecAssemblyBegin       2 1.0 2.3127e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 6.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecAssemblyEnd         2 1.0 4.0531e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
>  VecNormalize        7413 1.0 2.6424e+01 1.0 3.64e+08 1.0 0.0e+00 0.0e+00
> 7.4e+03  2  4  0  0 51   2  4  0  0 51   364
>
> ------------------------------------------------------------------------------------------------------------------------
>
>  Memory usage is given in bytes:
>
>  Object Type          Creations   Destructions   Memory  Descendants' Mem.
>
>  --- Event Stage 0: Main Stage
>
>              Matrix     2              2   65632332     0
>       Krylov Solver     1              1      17216     0
>      Preconditioner     1              1        168     0
>           Index Set     3              3    5185032     0
>                 Vec    36             36  120987640     0
>
> ========================================================================================================================
>  Average time to get PetscTime(): 3.09944e-07
>  OptionTable: -log_summary
>  Compiled without FORTRAN kernels
>  Compiled with full precision matrices (default)
>  sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
>  Configure run at: Tue Jan  8 22:22:08 2008
>  sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
>  Configure run at: Tue Jan  8 22:22:08 2008                    Configure
> options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8 --sizeof_short=2
> --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8 --sizeof_float=4
> --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4 --sizeof_MPI_Fint=4
> --with-vendor-compilers=intel --with-x=0
> --with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0
> --with-batch=1 --with-mpi-shared=0
> --with-mpi-include=/usr/local/topspin/mpi/mpich/include
> --with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a
> --with-mpirun=/usr/local/topspin/mpi/mpich/bi
>  n/mpirun --with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t
> --with-shared=0 -----------------------------------------
>  Libraries compiled on Tue Jan  8 22:34:13 SGT 2008 on atlas3-c01
>  Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed Jul 12
> 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
>  Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8
>  Using PETSc arch: atlas3-mpi
>  -----------------------------------------
>  Using C compiler: mpicc -fPIC -O  Using Fortran compiler: mpif90 -I. -fPIC
> -O   -----------------------------------------
>  Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8
> -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi
> -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/include -
>  I/home/enduser/g0306332/lib/hypre/include
> -I/usr/local/topspin/mpi/mpich/include
> ------------------------------------------
>  Using C linker: mpicc -fPIC -O
>  Using Fortran linker: mpif90 -I. -fPIC -O  Using libraries:
> -Wl,-rpath,/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi
> -L/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -lpetscts
> -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc
> -Wl,-rpath,/home/enduser/g0306332/lib/hypre/lib
> -L/home/enduser/g0306332/lib/hypre/lib -lHYPRE
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -Wl,-rpath,/usr/local/topspin/mpi/mpich/lib
> -L/usr/local/topspin/mpi/mpich/lib -lmpich
> -Wl,-rpath,/opt/intel/cmkl/8.1.1/lib/em64t -L/opt/intel/cmkl/8.1.1/lib/em64t
> -lmkl_lapack -lmkl_em64t -lguide -lpthread -Wl,-rpath,/usr/local/ofed/lib64
> -L/usr/local/ofed/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich -libverbs -libumad -lpthread -lrt
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/
> -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64
> -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -lmpichf90nc
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -Wl,-rpath,/opt/intel/fce/9.1.045/lib -L/opt/intel/fce/9.1.045/lib -lifport
> -lifcore -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lm
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl
> -lmpich -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 -libverbs
> -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -L/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/
> -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64
> -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -ldl -lc
>  ------------------------------------------
>  639.52user 4.80system 18:08.23elapsed 59%CPU (0avgtext+0avgdata
> 0maxresident)k
>  0inputs+0outputs (20major+172979minor)pagefaults 0swaps
>  Job  /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary
>
>  TID   HOST_NAME   COMMAND_LINE            STATUS
> TERMINATION_TIME
>  ===== ========== ================  =======================
> ===================
>  00000 atlas3-c45 time ./a.out -lo  Done                     04/16/2008
> 00:39:23
>
>
>  Barry Smith wrote:
>
> >
> >   It is taking 8776 iterations of GMRES! How many does it take on one
> process? This is a huge
> > amount.
> >
> > MatMult             8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03
> 0.0e+00 10 11100100  0  10 11100100  0   217
> > MatSolve            8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00
> 0.0e+00 17 11  0  0  0  17 11  0  0  0   120
> >
> > One process is spending 2.9 times as long in the embarresingly parallel
> MatSolve then the other process;
> > this indicates a huge imbalance in the number of nonzeros on each process.
> As Matt noticed, the partitioning
> > between the two processes is terrible.
> >
> >  Barry
> >
> > On Apr 15, 2008, at 10:56 AM, Ben Tay wrote:
> >
> > > Oh sorry here's the whole information. I'm using 2 processors currently:
> > >
> > >
> ************************************************************************************************************************
> > > ***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r
> -fCourier9' to print this document            ***
> > >
> ************************************************************************************************************************
> > >
> > > ---------------------------------------------- PETSc Performance
> Summary: ----------------------------------------------
> > >
> > > ./a.out on a atlas3-mp named atlas3-c05 with 2 processors, by g0306332
> Tue Apr 15 23:03:09 2008
> > > Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007
> HG revision: 414581156e67e55c761739b0deb119f7590d0f4b
> > >
> > >                       Max       Max/Min        Avg      Total
> > > Time (sec):           1.114e+03      1.00054   1.114e+03
> > > Objects:              5.400e+01      1.00000   5.400e+01
> > > Flops:                1.574e+11      1.00000   1.574e+11  3.147e+11
> > > Flops/sec:            1.414e+08      1.00054   1.413e+08  2.826e+08
> > > MPI Messages:         8.777e+03      1.00000   8.777e+03  1.755e+04
> > > MPI Message Lengths:  4.213e+07      1.00000   4.800e+03  8.425e+07
> > > MPI Reductions:       8.644e+03      1.00000
> > >
> > > Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
> > >                          e.g., VecAXPY() for real vectors of length N
> --> 2N flops
> > >                          and VecAXPY() for complex vectors of length N
> --> 8N flops
> > >
> > > Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages
> ---  -- Message Lengths --  -- Reductions --
> > >                      Avg     %Total     Avg     %Total   counts   %Total
> Avg         %Total   counts   %Total
> > > 0:      Main Stage: 1.1136e+03 100.0%  3.1475e+11 100.0%  1.755e+04
> 100.0%  4.800e+03      100.0%  1.729e+04 100.0%
> > >
> > >
> ------------------------------------------------------------------------------------------------------------------------
> > > See the 'Profiling' chapter of the users' manual for details on
> interpreting output.
> > > Phase summary info:
> > >  Count: number of times phase was executed
> > >  Time and Flops/sec: Max - maximum over all processors
> > >                     Ratio - ratio of maximum to minimum over all
> processors
> > >  Mess: number of messages sent
> > >  Avg. len: average message length
> > >  Reduct: number of global reductions
> > >  Global: entire computation
> > >  Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
> > >    %T - percent time in this phase         %F - percent flops in this
> phase
> > >    %M - percent messages in this phase     %L - percent message lengths
> in this phase
> > >    %R - percent reductions in this phase
> > >  Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time
> over all processors)
> > >
> ------------------------------------------------------------------------------------------------------------------------
> > >
> > >
> > >    ##########################################################
> > >    #                                                        #
> > >    #                          WARNING!!!                    #
> > >    #                                                        #
> > >    #   This code was run without the PreLoadBegin()         #
> > >    #   macros. To get timing results we always recommend    #
> > >    #   preloading. otherwise timing numbers may be          #
> > >    #   meaningless.                                         #
> > >    ##########################################################
> > >
> > >
> > > Event                Count      Time (sec)     Flops/sec
> --- Global ---  --- Stage ---   Total
> > >                 Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> > >
> ------------------------------------------------------------------------------------------------------------------------
> > >
> > > --- Event Stage 0: Main Stage
> > >
> > > MatMult             8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03
> 0.0e+00 10 11100100  0  10 11100100  0   217
> > > MatSolve            8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00
> 0.0e+00 17 11  0  0  0  17 11  0  0  0   120
> > > MatLUFactorNum         1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0   140
> > > MatILUFactorSym        1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > MatAssemblyBegin       1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00
> 0.0e+00 2.0e+00  3  0  0  0  0   3  0  0  0  0     0
> > > MatAssemblyEnd         1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 2.4e+03
> 7.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > MatGetRowIJ            1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > MatGetOrdering         1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > MatZeroEntries         1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > KSPGMRESOrthog      8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 0.0e+00
> 8.5e+03 50 72  0  0 49  50 72  0  0 49   363
> > > KSPSetup               2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > KSPSolve               1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 4.8e+03
> 1.7e+04 89100100100100  89100100100100   317
> > > PCSetUp                2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 0.0e+00
> 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
> > > PCSetUpOnBlocks        1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 0.0e+00
> 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
> > > PCApply             8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 0.0e+00
> 0.0e+00 18 11  0  0  0  18 11  0  0  0   114
> > > VecMDot             8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 0.0e+00
> 8.5e+03 35 36  0  0 49  35 36  0  0 49   213
> > > VecNorm             8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 0.0e+00
> 8.8e+03  9  2  0  0 51   9  2  0  0 51    42
> > > VecScale            8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 0.0e+00
> 0.0e+00  0  1  0  0  0   0  1  0  0  0   636
> > > VecCopy              284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > VecSet              9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> > > VecAXPY              567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0   346
> > > VecMAXPY            8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 0.0e+00
> 0.0e+00 16 38  0  0  0  16 38  0  0  0   453
> > > VecAssemblyBegin       2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 6.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > VecAssemblyEnd         2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > VecScatterBegin     8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 4.8e+03
> 0.0e+00  0  0100100  0   0  0100100  0     0
> > > VecScatterEnd       8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> > > VecNormalize        8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 0.0e+00
> 8.8e+03  9  4  0  0 51   9  4  0  0 51    62
> > >
> ------------------------------------------------------------------------------------------------------------------------
> > >
> > > Memory usage is given in bytes:
> > >
> > > Object Type          Creations   Destructions   Memory  Descendants'
> Mem.
> > >
> > > --- Event Stage 0: Main Stage
> > >
> > >            Matrix     4              4   49227380     0
> > >     Krylov Solver     2              2      17216     0
> > >    Preconditioner     2              2        256     0
> > >         Index Set     5              5    2596120     0
> > >               Vec    40             40   62243224     0
> > >       Vec Scatter     1              1          0     0
> > >
> ========================================================================================================================
> > > Average time to get PetscTime(): 4.05312e-07
> > > Average time for MPI_Barrier(): 7.62939e-07
> > > Average time for zero size MPI_Send(): 2.02656e-06
> > > OptionTable: -log_summary
> > > Compiled without FORTRAN kernels
> > > Compiled with full precision matrices (default)
> > > Compiled without FORTRAN kernels                              Compiled
> with full precision matrices (default)
> > > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8
> > > Configure run at: Tue Jan  8 22:22:08 2008
> > > Configure options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8
> --sizeof_short=2 --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8
> --sizeof_float=4 --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4
> --sizeof_MPI_Fint=4 --with-vendor-compilers=intel --with-x=0
> --with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0
> --with-batch=1 --with-mpi-shared=0
> --with-mpi-include=/usr/local/topspin/mpi/mpich/include
> --with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a
> --with-mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun
> --with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0
> > > -----------------------------------------
> > > Libraries compiled on Tue Jan  8 22:34:13 SGT 2008 on atlas3-c01
> > > Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed Jul
> 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux
> > > Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8
> > > Using PETSc arch: atlas3-mpi
> > > -----------------------------------------
> > > Using C compiler: mpicc -fPIC -O  Using Fortran compiler: mpif90 -I.
> -fPIC -O   -----------------------------------------
> > > Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8
> -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi
> -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/include -
> > > I/home/enduser/g0306332/lib/hypre/include
> -I/usr/local/topspin/mpi/mpich/include
> ------------------------------------------
> > > Using C linker: mpicc -fPIC -O
> > > Using Fortran linker: mpif90 -I. -fPIC -O  Using libraries:
> -Wl,-rpath,/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi
> -L/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -lpetscts
> -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc
> -Wl,-rpath,/home/enduser/g0306332/lib/hypre/lib
> -L/home/enduser/g0306332/lib/hypre/lib -lHYPRE
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -Wl,-rpath,/usr/local/topspin/mpi/mpich/lib
> -L/usr/local/topspin/mpi/mpich/lib -lmpich
> -Wl,-rpath,/opt/intel/cmkl/8.1.1/lib/em64t -L/opt/intel/cmkl/8.1.1/lib/em64t
> -lmkl_lapack -lmkl_em64t -lguide -lpthread -Wl,-rpath,/usr/local/ofed/lib64
> -L/usr/local/ofed/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich -libverbs -libumad -lpthread -lrt
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/
> -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64
> -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -lmpichf90nc
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -Wl,-rpath,/opt/intel/fce/9.1.045/lib -L/opt/intel/fce/9.1.045/lib -lifport
> -lifcore -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lm
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64
> -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib
> -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64
> -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl
> -lmpich -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 -libverbs
> -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib
> -L/opt/intel/cce/9.1.049/lib
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/
> -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64
> -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -ldl -lc
> > > ------------------------------------------
> > > 1079.77user 0.79system 18:34.82elapsed 96%CPU (0avgtext+0avgdata
> 0maxresident)k
> > > 0inputs+0outputs (28major+153248minor)pagefaults 0swaps
> > > 387.76user 3.95system 18:34.77elapsed 35%CPU (0avgtext+0avgdata
> 0maxresident)k
> > > 0inputs+0outputs (18major+158175minor)pagefaults 0swaps
> > > Job  /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary
> > >           TID   HOST_NAME   COMMAND_LINE            STATUS
> TERMINATION_TIME
> > > ===== ========== ================  =======================
> ===================
> > > 00000 atlas3-c05 time ./a.out -lo  Done                     04/15/2008
> 23:03:10
> > > 00001 atlas3-c05 time ./a.out -lo  Done                     04/15/2008
> 23:03:10
> > >
> > >
> > > I have a cartesian grid 600x720. Since there's 2 processors, it is
> partitioned to 600x360. I just use:
> > >
> > > call
> MatCreateMPIAIJ(MPI_COMM_WORLD,PETSC_DECIDE,PETSC_DECIDE,total_k,total_k,5,PETSC_NULL_INTEGER,5,PETSC_NULL_INTEGER,A_mat,ierr)
> > >
> > >      call MatSetFromOptions(A_mat,ierr)
> > >
> > >      call MatGetOwnershipRange(A_mat,ksta_p,kend_p,ierr)
> > >
> > >      call KSPCreate(MPI_COMM_WORLD,ksp,ierr)
> > >
> > >      call
> VecCreateMPI(MPI_COMM_WORLD,PETSC_DECIDE,size_x*size_y,b_rhs,ierr)
> > >
> > > total_k is actually size_x*size_y. Since it's 2d, the maximum values per
> row is 5. When you says setting off-process values, do you mean I insert
> values from 1 processor into another? I thought I insert the values into the
> correct processor...
> > >
> > > Thank you very much!
> > >
> > >
> > >
> > > Matthew Knepley wrote:
> > >
> > > > 1) Please never cut out parts of the summary. All the information is
> valuable,
> > > >   and most times, necessary
> > > >
> > > > 2) You seem to have huge load imbalance (look at VecNorm). Do you
> partition
> > > >   the system yourself. How many processes is this?
> > > >
> > > > 3) You seem to be setting a huge number of off-process values in the
> matrix
> > > >   (see MatAssemblyBegin). Is this true? I would reorganize this part.
> > > >
> > > >  Matt
> > > >
> > > > On Tue, Apr 15, 2008 at 10:33 AM, Ben Tay <zonexo at gmail.com> wrote:
> > > >
> > > >
> > > > > Hi,
> > > > >
> > > > > I have converted the poisson eqn part of the CFD code to parallel.
> The grid
> > > > > size tested is 600x720. For the momentum eqn, I used another serial
> linear
> > > > > solver (nspcg) to prevent mixing of results. Here's the output
> summary:
> > > > >
> > > > > --- Event Stage 0: Main Stage
> > > > >
> > > > > MatMult             8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04
> 4.8e+03
> > > > > 0.0e+00 10 11100100  0  10 11100100  0   217
> > > > > MatSolve            8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00
> 0.0e+00
> > > > > 0.0e+00 17 11  0  0  0  17 11  0  0  0   120
> > > > > MatLUFactorNum         1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00
> 0.0e+00
> > > > > 0.0e+00  0  0  0  0  0   0  0  0  0  0   140
> > > > > MatILUFactorSym        1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 1.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > > > *MatAssemblyBegin       1 1.0 5.6334e+01853005.4 0.00e+00 0.0
> 0.0e+00
> > > > > 0.0e+00 2.0e+00  3  0  0  0  0   3  0  0  0  0     0*
> > > > > MatAssemblyEnd         1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00
> 2.4e+03
> > > > > 7.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > > > MatGetRowIJ            1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > > > MatGetOrdering         1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 2.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > > > MatZeroEntries         1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > > > KSPGMRESOrthog      8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00
> 0.0e+00
> > > > > 8.5e+03 50 72  0  0 49  50 72  0  0 49   363
> > > > > KSPSetup               2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > > > KSPSolve               1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04
> 4.8e+03
> > > > > 1.7e+04 89100100100100  89100100100100   317
> > > > > PCSetUp                2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00
> 0.0e+00
> > > > > 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
> > > > > PCSetUpOnBlocks        1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00
> 0.0e+00
> > > > > 3.0e+00  0  0  0  0  0   0  0  0  0  0    69
> > > > > PCApply             8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00
> 0.0e+00
> > > > > 0.0e+00 18 11  0  0  0  18 11  0  0  0   114
> > > > > VecMDot             8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00
> 0.0e+00
> > > > > 8.5e+03 35 36  0  0 49  35 36  0  0 49   213
> > > > > *VecNorm             8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00
> 0.0e+00
> > > > > 8.8e+03  9  2  0  0 51   9  2  0  0 51    42*
> > > > > *VecScale            8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00
> 0.0e+00
> > > > > 0.0e+00  0  1  0  0  0   0  1  0  0  0   636*
> > > > > VecCopy              284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > > > VecSet              9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 0.0e+00  1  0  0  0  0   1  0  0  0  0     0
> > > > > VecAXPY              567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00
> 0.0e+00
> > > > > 0.0e+00  0  0  0  0  0   0  0  0  0  0   346
> > > > > VecMAXPY            8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00
> 0.0e+00
> > > > > 0.0e+00 16 38  0  0  0  16 38  0  0  0   453
> > > > > VecAssemblyBegin       2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 6.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > > > VecAssemblyEnd         2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> > > > > *VecScatterBegin     8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04
> 4.8e+03
> > > > > 0.0e+00  0  0100100  0   0  0100100  0     0*
> > > > > *VecScatterEnd       8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00
> 0.0e+00
> > > > > 0.0e+00  1  0  0  0  0   1  0  0  0  0     0*
> > > > > *VecNormalize        8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00
> 0.0e+00
> > > > > 8.8e+03  9  4  0  0 51   9  4  0  0 51    62*
> > > > >
> > > > >
> ------------------------------------------------------------------------------------------------------------------------
> > > > >  Memory usage is given in bytes:
> > > > >  Object Type          Creations   Destructions   Memory
> Descendants' Mem.
> > > > >   --- Event Stage 0: Main Stage
> > > > >                Matrix     4              4   49227380     0
> > > > >     Krylov Solver     2              2      17216     0
> > > > >    Preconditioner     2              2        256     0
> > > > >         Index Set     5              5    2596120     0
> > > > >               Vec    40             40   62243224     0
> > > > >       Vec Scatter     1              1          0     0
> > > > >
> ========================================================================================================================
> > > > > Average time to get PetscTime(): 4.05312e-07
> Average time
> > > > > for MPI_Barrier(): 7.62939e-07
> > > > > Average time for zero size MPI_Send(): 2.02656e-06
> > > > > OptionTable: -log_summary
> > > > >
> > > > >
> > > > > The PETSc manual states that ratio should be close to 1. There's
> quite a
> > > > > few *(in bold)* which are >1 and MatAssemblyBegin seems to be very
> big. So
> > > > > what could be the cause?
> > > > >
> > > > > I wonder if it has to do the way I insert the matrix. My steps are:
> > > > > (cartesian grids, i loop faster than j, fortran)
> > > > >
> > > > > For matrix A and rhs
> > > > >
> > > > > Insert left extreme cells values belonging to myid
> > > > >
> > > > > if (myid==0) then
> > > > >
> > > > >  insert corner cells values
> > > > >
> > > > >  insert south cells values
> > > > >
> > > > >  insert internal cells values
> > > > >
> > > > > else if (myid==num_procs-1) then
> > > > >
> > > > >  insert corner cells values
> > > > >
> > > > >  insert north cells values
> > > > >
> > > > >  insert internal cells values
> > > > >
> > > > > else
> > > > >
> > > > >  insert internal cells values
> > > > >
> > > > > end if
> > > > >
> > > > > Insert right extreme cells values belonging to myid
> > > > >
> > > > > All these values are entered into a big_A(size_x*size_y,5) matrix.
> int_A
> > > > > stores the position of the values. I then do
> > > > >
> > > > > call MatZeroEntries(A_mat,ierr)
> > > > >
> > > > >  do k=ksta_p+1,kend_p   !for cells belonging to myid
> > > > >
> > > > >      do kk=1,5
> > > > >
> > > > >          II=k-1
> > > > >
> > > > >          JJ=int_A(k,kk)-1
> > > > >
> > > > >          call
> MatSetValues(A_mat,1,II,1,JJ,big_A(k,kk),ADD_VALUES,ierr)
> > > > >                end do
> > > > >
> > > > >  end do
> > > > >
> > > > >  call MatAssemblyBegin(A_mat,MAT_FINAL_ASSEMBLY,ierr)
> > > > >
> > > > >  call MatAssemblyEnd(A_mat,MAT_FINAL_ASSEMBLY,ierr)
> > > > >
> > > > >
> > > > > I wonder if the problem lies here.I used the big_A matrix because I
> was
> > > > > migrating from an old linear solver. Lastly, I was told to widen my
> window
> > > > > to 120 characters. May I know how do I do it?
> > > > >
> > > > >
> > > > >
> > > > > Thank you very much.
> > > > >
> > > > > Matthew Knepley wrote:
> > > > >
> > > > >
> > > > >
> > > > > > On Mon, Apr 14, 2008 at 8:43 AM, Ben Tay <zonexo at gmail.com> 
> > > > > > wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Hi Matthew,
> > > > > > >
> > > > > > > I think you've misunderstood what I meant. What I'm trying to
> say is
> > > > > > > initially I've got a serial code. I tried to convert to a
> parallel one.
> > > > > > >
> > > > > > >
> > > > > >
> > > > > Then
> > > > >
> > > > >
> > > > > >
> > > > > > > I tested it and it was pretty slow. Due to some work
> requirement, I need
> > > > > > >
> > > > > > >
> > > > > >
> > > > > to
> > > > >
> > > > >
> > > > > >
> > > > > > > go back to make some changes to my code. Since the parallel is
> not
> > > > > > >
> > > > > > >
> > > > > >
> > > > > working
> > > > >
> > > > >
> > > > > >
> > > > > > > well, I updated and changed the serial one.
> > > > > > >
> > > > > > > Well, that was a while ago and now, due to the updates and
> changes, the
> > > > > > > serial code is different from the old converted parallel code.
> Some
> > > > > > >
> > > > > > >
> > > > > >
> > > > > files
> > > > >
> > > > >
> > > > > >
> > > > > > > were also deleted and I can't seem to get it working now. So I
> thought I
> > > > > > > might as well convert the new serial code to parallel. But I'm
> not very
> > > > > > >
> > > > > > >
> > > > > >
> > > > > sure
> > > > >
> > > > >
> > > > > >
> > > > > > > what I should do 1st.
> > > > > > >
> > > > > > > Maybe I should rephrase my question in that if I just convert my
> > > > > > >
> > > > > > >
> > > > > >
> > > > > poisson
> > > > >
> > > > >
> > > > > >
> > > > > > > equation subroutine from a serial PETSc to a parallel PETSc
> version,
> > > > > > >
> > > > > > >
> > > > > >
> > > > > will it
> > > > >
> > > > >
> > > > > >
> > > > > > > work? Should I expect a speedup? The rest of my code is still
> serial.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > You should, of course, only expect speedup in the parallel parts
> > > > > >
> > > > > > Matt
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Thank you very much.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Matthew Knepley wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > I am not sure why you would ever have two codes. I never do
> this.
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > PETSc
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > is designed to write one code to run in serial and parallel.
> The PETSc
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > part
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > should look identical. To test, run the code yo uhave verified
> in
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > serial
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > and
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > output PETSc data structures (like Mat and Vec) using a binary
> viewer.
> > > > > > > > Then run in parallel with the same code, which will output the
> same
> > > > > > > > structures. Take the two files and write a small verification
> code
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > that
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > loads both versions and calls MatEqual and VecEqual.
> > > > > > > >
> > > > > > > > Matt
> > > > > > > >
> > > > > > > > On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay <zonexo at gmail.com>
> wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > Thank you Matthew. Sorry to trouble you again.
> > > > > > > > >
> > > > > > > > > I tried to run it with -log_summary output and I found that
> there's
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > some
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > errors in the execution. Well, I was busy with other things
> and I
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > just
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > came
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > back to this problem. Some of my files on the server has
> also been
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > deleted.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > It has been a while and I  remember that  it worked before,
> only
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > much
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > slower.
> > > > > > > > >
> > > > > > > > > Anyway, most of the serial code has been updated and maybe
> it's
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > easier
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > to
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > convert the new serial code instead of debugging on the old
> parallel
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > code
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > now. I believe I can still reuse part of the old parallel
> code.
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > However,
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > I
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > hope I can approach it better this time.
> > > > > > > > >
> > > > > > > > > So supposed I need to start converting my new serial code to
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > parallel.
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > There's 2 eqns to be solved using PETSc, the momentum and
> poisson. I
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > also
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > need to parallelize other parts of my code. I wonder which
> route is
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > the
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > best:
> > > > > > > > >
> > > > > > > > > 1. Don't change the PETSc part ie continue using
> PETSC_COMM_SELF,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > modify
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > other parts of my code to parallel e.g. looping, updating of
> values
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > etc.
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > Once the execution is fine and speedup is reasonable, then
> modify
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > the
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > PETSc
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > part - poisson eqn 1st followed by the momentum eqn.
> > > > > > > > >
> > > > > > > > > 2. Reverse the above order ie modify the PETSc part -
> poisson eqn
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > 1st
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > followed by the momentum eqn. Then do other parts of my
> code.
> > > > > > > > >
> > > > > > > > > I'm not sure if the above 2 mtds can work or if there will
> be
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > conflicts. Of
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > course, an alternative will be:
> > > > > > > > >
> > > > > > > > > 3. Do the poisson, momentum eqns and other parts of the code
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > separately.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > That is, code a standalone parallel poisson eqn and use
> samples
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > values
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > to
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > test it. Same for the momentum and other parts of the code.
> When
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > each of
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > them is working, combine them to form the full parallel
> code.
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > However,
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > this
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > will be much more troublesome.
> > > > > > > > >
> > > > > > > > > I hope someone can give me some recommendations.
> > > > > > > > >
> > > > > > > > > Thank you once again.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Matthew Knepley wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > 1) There is no way to have any idea what is going on in
> your code
> > > > > > > > > > without -log_summary output
> > > > > > > > > >
> > > > > > > > > > 2) Looking at that output, look at the percentage taken by
> the
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > solver
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > KSPSolve event. I suspect it is not the biggest component,
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > because
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > it is very scalable.
> > > > > > > > > >
> > > > > > > > > > Matt
> > > > > > > > > >
> > > > > > > > > > On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay
> <zonexo at gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > I've a serial 2D CFD code. As my grid size requirement
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > increases,
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > the
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > simulation takes longer. Also, memory requirement
> becomes a
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > problem.
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > Grid
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > size 've reached 1200x1200. Going higher is not possible
> due to
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > memory
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > problem.
> > > > > > > > > > >
> > > > > > > > > > > I tried to convert my code to a parallel one, following
> the
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > examples
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > given.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > I also need to restructure parts of my code to enable
> parallel
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > looping.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > I
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 1st changed the PETSc solver to be parallel enabled and
> then I
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > restructured
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > parts of my code. I proceed on as longer as the answer
> for a
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > simple
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > test
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > case is correct. I thought it's not really possible to
> do any
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > speed
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > testing
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > since the code is not fully parallelized yet. When I
> finished
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > during
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > most of
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > the conversion, I found that in the actual run that it
> is much
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > slower,
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > although the answer is correct.
> > > > > > > > > > >
> > > > > > > > > > > So what is the remedy now? I wonder what I should do to
> check
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > what's
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > wrong.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Must I restart everything again? Btw, my grid size is
> 1200x1200.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > I
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > believed
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > it should be suitable for parallel run of 4 processors?
> Is that
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > so?
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Thank you.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
> >
>
>



-- 
What most experimenters take for granted before they begin their
experiments is infinitely more interesting than any results to which
their experiments lead.
-- Norbert Wiener

Slow speed after changing from serial to parallel

Reply via email to