The convergence here is jsut horrendous. Have you tried using LU to check your implementation? All the time is in the solve right now. I would first try a direct method (at least on a small problem) and then try to understand the convergence behavior. MUMPS can actually scale very well for big problems.
Matt On Tue, Apr 15, 2008 at 11:44 AM, Ben Tay <zonexo at gmail.com> wrote: > Hi, > > Here's the summary for 1 processor. Seems like it's also using a long > time... Can someone tell me when my mistakes possibly lie? Thank you very > much! > > > ************************************************************************************************************************ > *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r > -fCourier9' to print this document *** > > ************************************************************************************************************************ > > ---------------------------------------------- PETSc Performance Summary: > ---------------------------------------------- > > ./a.out on a atlas3-mp named atlas3-c45 with 1 processor, by g0306332 Wed > Apr 16 00:39:22 2008 > Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 HG > revision: 414581156e67e55c761739b0deb119f7590d0f4b > > Max Max/Min Avg Total > Time (sec): 1.088e+03 1.00000 1.088e+03 > Objects: 4.300e+01 1.00000 4.300e+01 > Flops: 2.658e+11 1.00000 2.658e+11 2.658e+11 > Flops/sec: 2.444e+08 1.00000 2.444e+08 2.444e+08 > MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00 > MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00 > MPI Reductions: 1.460e+04 1.00000 > > Flop counting convention: 1 flop = 1 real number operation of type > (multiply/divide/add/subtract) > e.g., VecAXPY() for real vectors of length N --> > 2N flops > and VecAXPY() for complex vectors of length N --> > 8N flops > > Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- > -- Message Lengths -- -- Reductions -- > Avg %Total Avg %Total counts %Total > Avg %Total counts %Total > 0: Main Stage: 1.0877e+03 100.0% 2.6584e+11 100.0% 0.000e+00 0.0% > 0.000e+00 0.0% 1.460e+04 100.0% > > > ------------------------------------------------------------------------------------------------------------------------ > See the 'Profiling' chapter of the users' manual for details on > interpreting output. > Phase summary info: > Count: number of times phase was executed > Time and Flops/sec: Max - maximum over all processors > Ratio - ratio of maximum to minimum over all > processors > Mess: number of messages sent > Avg. len: average message length > Reduct: number of global reductions > Global: entire computation > Stage: stages of a computation. Set stages with PetscLogStagePush() and > PetscLogStagePop(). > %T - percent time in this phase %F - percent flops in this > phase > %M - percent messages in this phase %L - percent message lengths in > this phase > %R - percent reductions in this phase > Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over > all processors) > > ------------------------------------------------------------------------------------------------------------------------ > > > ########################################################## > # # > # WARNING!!! # > # # > # This code was run without the PreLoadBegin() # > # macros. To get timing results we always recommend # > # preloading. otherwise timing numbers may be # > # meaningless. # > # preloading. otherwise timing numbers may be # > # meaningless. # > ########################################################## > > > Event Count Time (sec) Flops/sec > --- Global --- --- Stage --- Total > Max Ratio Max Ratio Max Ratio Mess Avg len > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > > ------------------------------------------------------------------------------------------------------------------------ > > --- Event Stage 0: Main Stage > > MatMult 7412 1.0 1.3344e+02 1.0 2.16e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 12 11 0 0 0 12 11 0 0 0 216 > MatSolve 7413 1.0 2.6851e+02 1.0 1.07e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 25 11 0 0 0 25 11 0 0 0 107 > MatLUFactorNum 1 1.0 4.3947e-02 1.0 8.83e+07 1.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 88 > MatILUFactorSym 1 1.0 3.7798e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyBegin 1 1.0 9.5367e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAssemblyEnd 1 1.0 2.5835e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetRowIJ 1 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatGetOrdering 1 1.0 6.0391e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatZeroEntries 1 1.0 1.7377e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > KSPGMRESOrthog 7173 1.0 5.6323e+02 1.0 3.41e+08 1.0 0.0e+00 0.0e+00 > 7.2e+03 52 72 0 0 49 52 72 0 0 49 341 > KSPSetup 1 1.0 1.2676e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > KSPSolve 1 1.0 1.0144e+03 1.0 2.62e+08 1.0 0.0e+00 0.0e+00 > 1.5e+04 93100 0 0100 93100 0 0100 262 > PCSetUp 1 1.0 8.7809e-02 1.0 4.42e+07 1.0 0.0e+00 0.0e+00 > 3.0e+00 0 0 0 0 0 0 0 0 0 0 44 > PCApply 7413 1.0 2.6853e+02 1.0 1.07e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 25 11 0 0 0 25 11 0 0 0 107 > VecMDot 7173 1.0 2.6720e+02 1.0 3.59e+08 1.0 0.0e+00 0.0e+00 > 7.2e+03 25 36 0 0 49 25 36 0 0 49 359 > VecNorm 7413 1.0 1.7125e+01 1.0 3.74e+08 1.0 0.0e+00 0.0e+00 > 7.4e+03 2 2 0 0 51 2 2 0 0 51 374 > VecScale 7413 1.0 9.2787e+00 1.0 3.45e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 1 0 0 0 1 1 0 0 0 345 > VecCopy 240 1.0 5.1628e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecSet 241 1.0 6.4428e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecAXPY 479 1.0 2.0082e+00 1.0 2.06e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 206 > VecMAXPY 7413 1.0 3.1536e+02 1.0 3.24e+08 1.0 0.0e+00 0.0e+00 > 0.0e+00 29 38 0 0 0 29 38 0 0 0 324 > VecAssemblyBegin 2 1.0 2.3127e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 6.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecAssemblyEnd 2 1.0 4.0531e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecNormalize 7413 1.0 2.6424e+01 1.0 3.64e+08 1.0 0.0e+00 0.0e+00 > 7.4e+03 2 4 0 0 51 2 4 0 0 51 364 > > ------------------------------------------------------------------------------------------------------------------------ > > Memory usage is given in bytes: > > Object Type Creations Destructions Memory Descendants' Mem. > > --- Event Stage 0: Main Stage > > Matrix 2 2 65632332 0 > Krylov Solver 1 1 17216 0 > Preconditioner 1 1 168 0 > Index Set 3 3 5185032 0 > Vec 36 36 120987640 0 > > ======================================================================================================================== > Average time to get PetscTime(): 3.09944e-07 > OptionTable: -log_summary > Compiled without FORTRAN kernels > Compiled with full precision matrices (default) > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 > sizeof(PetscScalar) 8 > Configure run at: Tue Jan 8 22:22:08 2008 > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 > sizeof(PetscScalar) 8 > Configure run at: Tue Jan 8 22:22:08 2008 Configure > options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8 --sizeof_short=2 > --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8 --sizeof_float=4 > --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4 --sizeof_MPI_Fint=4 > --with-vendor-compilers=intel --with-x=0 > --with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0 > --with-batch=1 --with-mpi-shared=0 > --with-mpi-include=/usr/local/topspin/mpi/mpich/include > --with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a > --with-mpirun=/usr/local/topspin/mpi/mpich/bi > n/mpirun --with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t > --with-shared=0 ----------------------------------------- > Libraries compiled on Tue Jan 8 22:34:13 SGT 2008 on atlas3-c01 > Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed Jul 12 > 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux > Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8 > Using PETSc arch: atlas3-mpi > ----------------------------------------- > Using C compiler: mpicc -fPIC -O Using Fortran compiler: mpif90 -I. -fPIC > -O ----------------------------------------- > Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8 > -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi > -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/include - > I/home/enduser/g0306332/lib/hypre/include > -I/usr/local/topspin/mpi/mpich/include > ------------------------------------------ > Using C linker: mpicc -fPIC -O > Using Fortran linker: mpif90 -I. -fPIC -O Using libraries: > -Wl,-rpath,/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi > -L/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -lpetscts > -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc > -Wl,-rpath,/home/enduser/g0306332/lib/hypre/lib > -L/home/enduser/g0306332/lib/hypre/lib -lHYPRE > -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 > -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib > -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -Wl,-rpath,/usr/local/topspin/mpi/mpich/lib > -L/usr/local/topspin/mpi/mpich/lib -lmpich > -Wl,-rpath,/opt/intel/cmkl/8.1.1/lib/em64t -L/opt/intel/cmkl/8.1.1/lib/em64t > -lmkl_lapack -lmkl_em64t -lguide -lpthread -Wl,-rpath,/usr/local/ofed/lib64 > -L/usr/local/ofed/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib > -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich -libverbs -libumad -lpthread -lrt > -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ > -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64 > -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -lmpichf90nc > -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 > -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -Wl,-rpath,/opt/intel/fce/9.1.045/lib -L/opt/intel/fce/9.1.045/lib -lifport > -lifcore -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib > -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lm > -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 > -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib > -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 > -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib > -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl > -lmpich -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 -libverbs > -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -L/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ > -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64 > -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -ldl -lc > ------------------------------------------ > 639.52user 4.80system 18:08.23elapsed 59%CPU (0avgtext+0avgdata > 0maxresident)k > 0inputs+0outputs (20major+172979minor)pagefaults 0swaps > Job /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary > > TID HOST_NAME COMMAND_LINE STATUS > TERMINATION_TIME > ===== ========== ================ ======================= > =================== > 00000 atlas3-c45 time ./a.out -lo Done 04/16/2008 > 00:39:23 > > > Barry Smith wrote: > > > > > It is taking 8776 iterations of GMRES! How many does it take on one > process? This is a huge > > amount. > > > > MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03 > 0.0e+00 10 11100100 0 10 11100100 0 217 > > MatSolve 8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00 > 0.0e+00 17 11 0 0 0 17 11 0 0 0 120 > > > > One process is spending 2.9 times as long in the embarresingly parallel > MatSolve then the other process; > > this indicates a huge imbalance in the number of nonzeros on each process. > As Matt noticed, the partitioning > > between the two processes is terrible. > > > > Barry > > > > On Apr 15, 2008, at 10:56 AM, Ben Tay wrote: > > > > > Oh sorry here's the whole information. I'm using 2 processors currently: > > > > > > > ************************************************************************************************************************ > > > *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r > -fCourier9' to print this document *** > > > > ************************************************************************************************************************ > > > > > > ---------------------------------------------- PETSc Performance > Summary: ---------------------------------------------- > > > > > > ./a.out on a atlas3-mp named atlas3-c05 with 2 processors, by g0306332 > Tue Apr 15 23:03:09 2008 > > > Using Petsc Release Version 2.3.3, Patch 8, Fri Nov 16 17:03:40 CST 2007 > HG revision: 414581156e67e55c761739b0deb119f7590d0f4b > > > > > > Max Max/Min Avg Total > > > Time (sec): 1.114e+03 1.00054 1.114e+03 > > > Objects: 5.400e+01 1.00000 5.400e+01 > > > Flops: 1.574e+11 1.00000 1.574e+11 3.147e+11 > > > Flops/sec: 1.414e+08 1.00054 1.413e+08 2.826e+08 > > > MPI Messages: 8.777e+03 1.00000 8.777e+03 1.755e+04 > > > MPI Message Lengths: 4.213e+07 1.00000 4.800e+03 8.425e+07 > > > MPI Reductions: 8.644e+03 1.00000 > > > > > > Flop counting convention: 1 flop = 1 real number operation of type > (multiply/divide/add/subtract) > > > e.g., VecAXPY() for real vectors of length N > --> 2N flops > > > and VecAXPY() for complex vectors of length N > --> 8N flops > > > > > > Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages > --- -- Message Lengths -- -- Reductions -- > > > Avg %Total Avg %Total counts %Total > Avg %Total counts %Total > > > 0: Main Stage: 1.1136e+03 100.0% 3.1475e+11 100.0% 1.755e+04 > 100.0% 4.800e+03 100.0% 1.729e+04 100.0% > > > > > > > ------------------------------------------------------------------------------------------------------------------------ > > > See the 'Profiling' chapter of the users' manual for details on > interpreting output. > > > Phase summary info: > > > Count: number of times phase was executed > > > Time and Flops/sec: Max - maximum over all processors > > > Ratio - ratio of maximum to minimum over all > processors > > > Mess: number of messages sent > > > Avg. len: average message length > > > Reduct: number of global reductions > > > Global: entire computation > > > Stage: stages of a computation. Set stages with PetscLogStagePush() and > PetscLogStagePop(). > > > %T - percent time in this phase %F - percent flops in this > phase > > > %M - percent messages in this phase %L - percent message lengths > in this phase > > > %R - percent reductions in this phase > > > Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time > over all processors) > > > > ------------------------------------------------------------------------------------------------------------------------ > > > > > > > > > ########################################################## > > > # # > > > # WARNING!!! # > > > # # > > > # This code was run without the PreLoadBegin() # > > > # macros. To get timing results we always recommend # > > > # preloading. otherwise timing numbers may be # > > > # meaningless. # > > > ########################################################## > > > > > > > > > Event Count Time (sec) Flops/sec > --- Global --- --- Stage --- Total > > > Max Ratio Max Ratio Max Ratio Mess Avg len > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > > > > ------------------------------------------------------------------------------------------------------------------------ > > > > > > --- Event Stage 0: Main Stage > > > > > > MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 4.8e+03 > 0.0e+00 10 11100100 0 10 11100100 0 217 > > > MatSolve 8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 0.0e+00 > 0.0e+00 17 11 0 0 0 17 11 0 0 0 120 > > > MatLUFactorNum 1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 140 > > > MatILUFactorSym 1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > MatAssemblyBegin 1 1.0 5.6334e+01853005.4 0.00e+00 0.0 0.0e+00 > 0.0e+00 2.0e+00 3 0 0 0 0 3 0 0 0 0 0 > > > MatAssemblyEnd 1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 2.4e+03 > 7.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > MatGetRowIJ 1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > MatGetOrdering 1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > MatZeroEntries 1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > KSPGMRESOrthog 8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 0.0e+00 > 8.5e+03 50 72 0 0 49 50 72 0 0 49 363 > > > KSPSetup 2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > KSPSolve 1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 4.8e+03 > 1.7e+04 89100100100100 89100100100100 317 > > > PCSetUp 2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 0.0e+00 > 3.0e+00 0 0 0 0 0 0 0 0 0 0 69 > > > PCSetUpOnBlocks 1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 0.0e+00 > 3.0e+00 0 0 0 0 0 0 0 0 0 0 69 > > > PCApply 8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 0.0e+00 > 0.0e+00 18 11 0 0 0 18 11 0 0 0 114 > > > VecMDot 8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 0.0e+00 > 8.5e+03 35 36 0 0 49 35 36 0 0 49 213 > > > VecNorm 8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 0.0e+00 > 8.8e+03 9 2 0 0 51 9 2 0 0 51 42 > > > VecScale 8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 0.0e+00 > 0.0e+00 0 1 0 0 0 0 1 0 0 0 636 > > > VecCopy 284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > VecSet 9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > > VecAXPY 567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 346 > > > VecMAXPY 8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 0.0e+00 > 0.0e+00 16 38 0 0 0 16 38 0 0 0 453 > > > VecAssemblyBegin 2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 > 6.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > VecAssemblyEnd 2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > VecScatterBegin 8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 4.8e+03 > 0.0e+00 0 0100100 0 0 0100100 0 0 > > > VecScatterEnd 8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > > VecNormalize 8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 0.0e+00 > 8.8e+03 9 4 0 0 51 9 4 0 0 51 62 > > > > ------------------------------------------------------------------------------------------------------------------------ > > > > > > Memory usage is given in bytes: > > > > > > Object Type Creations Destructions Memory Descendants' > Mem. > > > > > > --- Event Stage 0: Main Stage > > > > > > Matrix 4 4 49227380 0 > > > Krylov Solver 2 2 17216 0 > > > Preconditioner 2 2 256 0 > > > Index Set 5 5 2596120 0 > > > Vec 40 40 62243224 0 > > > Vec Scatter 1 1 0 0 > > > > ======================================================================================================================== > > > Average time to get PetscTime(): 4.05312e-07 > > > Average time for MPI_Barrier(): 7.62939e-07 > > > Average time for zero size MPI_Send(): 2.02656e-06 > > > OptionTable: -log_summary > > > Compiled without FORTRAN kernels > > > Compiled with full precision matrices (default) > > > Compiled without FORTRAN kernels Compiled > with full precision matrices (default) > > > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 > sizeof(PetscScalar) 8 > > > Configure run at: Tue Jan 8 22:22:08 2008 > > > Configure options: --with-memcmp-ok --sizeof_char=1 --sizeof_void_p=8 > --sizeof_short=2 --sizeof_int=4 --sizeof_long=8 --sizeof_long_long=8 > --sizeof_float=4 --sizeof_double=8 --bits_per_byte=8 --sizeof_MPI_Comm=4 > --sizeof_MPI_Fint=4 --with-vendor-compilers=intel --with-x=0 > --with-hypre-dir=/home/enduser/g0306332/lib/hypre --with-debugging=0 > --with-batch=1 --with-mpi-shared=0 > --with-mpi-include=/usr/local/topspin/mpi/mpich/include > --with-mpi-lib=/usr/local/topspin/mpi/mpich/lib/libmpich.a > --with-mpirun=/usr/local/topspin/mpi/mpich/bin/mpirun > --with-blas-lapack-dir=/opt/intel/cmkl/8.1.1/lib/em64t --with-shared=0 > > > ----------------------------------------- > > > Libraries compiled on Tue Jan 8 22:34:13 SGT 2008 on atlas3-c01 > > > Machine characteristics: Linux atlas3-c01 2.6.9-42.ELsmp #1 SMP Wed Jul > 12 23:32:02 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux > > > Using PETSc directory: /nfs/home/enduser/g0306332/petsc-2.3.3-p8 > > > Using PETSc arch: atlas3-mpi > > > ----------------------------------------- > > > Using C compiler: mpicc -fPIC -O Using Fortran compiler: mpif90 -I. > -fPIC -O ----------------------------------------- > > > Using include paths: -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8 > -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/bmake/atlas3-mpi > -I/nfs/home/enduser/g0306332/petsc-2.3.3-p8/include - > > > I/home/enduser/g0306332/lib/hypre/include > -I/usr/local/topspin/mpi/mpich/include > ------------------------------------------ > > > Using C linker: mpicc -fPIC -O > > > Using Fortran linker: mpif90 -I. -fPIC -O Using libraries: > -Wl,-rpath,/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi > -L/nfs/home/enduser/g0306332/petsc-2.3.3-p8/lib/atlas3-mpi -lpetscts > -lpetscsnes -lpetscksp -lpetscdm -lpetscmat -lpetscvec -lpetsc > -Wl,-rpath,/home/enduser/g0306332/lib/hypre/lib > -L/home/enduser/g0306332/lib/hypre/lib -lHYPRE > -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 > -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib > -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -Wl,-rpath,/usr/local/topspin/mpi/mpich/lib > -L/usr/local/topspin/mpi/mpich/lib -lmpich > -Wl,-rpath,/opt/intel/cmkl/8.1.1/lib/em64t -L/opt/intel/cmkl/8.1.1/lib/em64t > -lmkl_lapack -lmkl_em64t -lguide -lpthread -Wl,-rpath,/usr/local/ofed/lib64 > -L/usr/local/ofed/lib64 -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib > -L/opt/mvapich/0.9.9/gen2/lib -ldl -lmpich -libverbs -libumad -lpthread -lrt > -Wl,-rpath,/opt/intel/cce/9.1.049/lib -L/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ > -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64 > -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -lmpichf90nc > -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 > -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -Wl,-rpath,/opt/intel/fce/9.1.045/lib -L/opt/intel/fce/9.1.045/lib -lifport > -lifcore -lm -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib > -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -lm > -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 > -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib > -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -Wl,-rpath,/usr/local/ofed/lib64 > -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -lstdc++ -lcxaguard -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib > -Wl,-rpath,/usr/local/ofed/lib64 -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 > -Wl,-rpath,/opt/mvapich/0.9.9/gen2/lib -L/opt/mvapich/0.9.9/gen2/lib -ldl > -lmpich -Wl,-rpath,/usr/local/ofed/lib64 -L/usr/local/ofed/lib64 -libverbs > -libumad -lpthread -lrt -Wl,-rpath,/opt/intel/cce/9.1.049/lib > -L/opt/intel/cce/9.1.049/lib > -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ > -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6/ -Wl,-rpath,/usr/lib64 -L/usr/lib64 > -lsvml -limf -lipgo -lirc -lgcc_s -lirc_s -ldl -lc > > > ------------------------------------------ > > > 1079.77user 0.79system 18:34.82elapsed 96%CPU (0avgtext+0avgdata > 0maxresident)k > > > 0inputs+0outputs (28major+153248minor)pagefaults 0swaps > > > 387.76user 3.95system 18:34.77elapsed 35%CPU (0avgtext+0avgdata > 0maxresident)k > > > 0inputs+0outputs (18major+158175minor)pagefaults 0swaps > > > Job /usr/lsf62/bin/mvapich_wrapper time ./a.out -log_summary > > > TID HOST_NAME COMMAND_LINE STATUS > TERMINATION_TIME > > > ===== ========== ================ ======================= > =================== > > > 00000 atlas3-c05 time ./a.out -lo Done 04/15/2008 > 23:03:10 > > > 00001 atlas3-c05 time ./a.out -lo Done 04/15/2008 > 23:03:10 > > > > > > > > > I have a cartesian grid 600x720. Since there's 2 processors, it is > partitioned to 600x360. I just use: > > > > > > call > MatCreateMPIAIJ(MPI_COMM_WORLD,PETSC_DECIDE,PETSC_DECIDE,total_k,total_k,5,PETSC_NULL_INTEGER,5,PETSC_NULL_INTEGER,A_mat,ierr) > > > > > > call MatSetFromOptions(A_mat,ierr) > > > > > > call MatGetOwnershipRange(A_mat,ksta_p,kend_p,ierr) > > > > > > call KSPCreate(MPI_COMM_WORLD,ksp,ierr) > > > > > > call > VecCreateMPI(MPI_COMM_WORLD,PETSC_DECIDE,size_x*size_y,b_rhs,ierr) > > > > > > total_k is actually size_x*size_y. Since it's 2d, the maximum values per > row is 5. When you says setting off-process values, do you mean I insert > values from 1 processor into another? I thought I insert the values into the > correct processor... > > > > > > Thank you very much! > > > > > > > > > > > > Matthew Knepley wrote: > > > > > > > 1) Please never cut out parts of the summary. All the information is > valuable, > > > > and most times, necessary > > > > > > > > 2) You seem to have huge load imbalance (look at VecNorm). Do you > partition > > > > the system yourself. How many processes is this? > > > > > > > > 3) You seem to be setting a huge number of off-process values in the > matrix > > > > (see MatAssemblyBegin). Is this true? I would reorganize this part. > > > > > > > > Matt > > > > > > > > On Tue, Apr 15, 2008 at 10:33 AM, Ben Tay <zonexo at gmail.com> wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > I have converted the poisson eqn part of the CFD code to parallel. > The grid > > > > > size tested is 600x720. For the momentum eqn, I used another serial > linear > > > > > solver (nspcg) to prevent mixing of results. Here's the output > summary: > > > > > > > > > > --- Event Stage 0: Main Stage > > > > > > > > > > MatMult 8776 1.0 1.5701e+02 2.2 2.43e+08 2.2 1.8e+04 > 4.8e+03 > > > > > 0.0e+00 10 11100100 0 10 11100100 0 217 > > > > > MatSolve 8777 1.0 2.8379e+02 2.9 1.73e+08 2.9 0.0e+00 > 0.0e+00 > > > > > 0.0e+00 17 11 0 0 0 17 11 0 0 0 120 > > > > > MatLUFactorNum 1 1.0 2.7618e-02 1.2 8.68e+07 1.2 0.0e+00 > 0.0e+00 > > > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 140 > > > > > MatILUFactorSym 1 1.0 2.4259e-02 1.1 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > > > 1.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > > *MatAssemblyBegin 1 1.0 5.6334e+01853005.4 0.00e+00 0.0 > 0.0e+00 > > > > > 0.0e+00 2.0e+00 3 0 0 0 0 3 0 0 0 0 0* > > > > > MatAssemblyEnd 1 1.0 4.7958e-02 1.0 0.00e+00 0.0 2.0e+00 > 2.4e+03 > > > > > 7.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > > MatGetRowIJ 1 1.0 3.0994e-06 1.1 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > > MatGetOrdering 1 1.0 3.8640e-03 1.3 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > > > 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > > MatZeroEntries 1 1.0 1.8353e-02 1.2 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > > KSPGMRESOrthog 8493 1.0 6.2636e+02 1.3 2.32e+08 1.3 0.0e+00 > 0.0e+00 > > > > > 8.5e+03 50 72 0 0 49 50 72 0 0 49 363 > > > > > KSPSetup 2 1.0 1.0490e-02 1.3 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > > KSPSolve 1 1.0 9.9177e+02 1.0 1.59e+08 1.0 1.8e+04 > 4.8e+03 > > > > > 1.7e+04 89100100100100 89100100100100 317 > > > > > PCSetUp 2 1.0 5.5893e-02 1.2 4.02e+07 1.2 0.0e+00 > 0.0e+00 > > > > > 3.0e+00 0 0 0 0 0 0 0 0 0 0 69 > > > > > PCSetUpOnBlocks 1 1.0 5.5777e-02 1.2 4.03e+07 1.2 0.0e+00 > 0.0e+00 > > > > > 3.0e+00 0 0 0 0 0 0 0 0 0 0 69 > > > > > PCApply 8777 1.0 2.9987e+02 2.9 1.63e+08 2.9 0.0e+00 > 0.0e+00 > > > > > 0.0e+00 18 11 0 0 0 18 11 0 0 0 114 > > > > > VecMDot 8493 1.0 5.3381e+02 2.2 2.36e+08 2.2 0.0e+00 > 0.0e+00 > > > > > 8.5e+03 35 36 0 0 49 35 36 0 0 49 213 > > > > > *VecNorm 8777 1.0 1.8237e+0210.2 2.13e+0810.2 0.0e+00 > 0.0e+00 > > > > > 8.8e+03 9 2 0 0 51 9 2 0 0 51 42* > > > > > *VecScale 8777 1.0 5.9594e+00 4.7 1.49e+09 4.7 0.0e+00 > 0.0e+00 > > > > > 0.0e+00 0 1 0 0 0 0 1 0 0 0 636* > > > > > VecCopy 284 1.0 4.2563e-01 1.2 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > > VecSet 9062 1.0 1.5833e+01 2.6 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > > > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 > > > > > VecAXPY 567 1.0 1.4142e+00 2.8 4.90e+08 2.8 0.0e+00 > 0.0e+00 > > > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 346 > > > > > VecMAXPY 8777 1.0 2.6692e+02 2.7 6.15e+08 2.7 0.0e+00 > 0.0e+00 > > > > > 0.0e+00 16 38 0 0 0 16 38 0 0 0 453 > > > > > VecAssemblyBegin 2 1.0 1.6093e-04 2.5 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > > > 6.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > > VecAssemblyEnd 2 1.0 4.7684e-06 1.7 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > > > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > > > > > *VecScatterBegin 8776 1.0 6.6898e-01 6.7 0.00e+00 0.0 1.8e+04 > 4.8e+03 > > > > > 0.0e+00 0 0100100 0 0 0100100 0 0* > > > > > *VecScatterEnd 8776 1.0 1.7747e+0130.1 0.00e+00 0.0 0.0e+00 > 0.0e+00 > > > > > 0.0e+00 1 0 0 0 0 1 0 0 0 0 0* > > > > > *VecNormalize 8777 1.0 1.8366e+02 7.7 2.39e+08 7.7 0.0e+00 > 0.0e+00 > > > > > 8.8e+03 9 4 0 0 51 9 4 0 0 51 62* > > > > > > > > > > > ------------------------------------------------------------------------------------------------------------------------ > > > > > Memory usage is given in bytes: > > > > > Object Type Creations Destructions Memory > Descendants' Mem. > > > > > --- Event Stage 0: Main Stage > > > > > Matrix 4 4 49227380 0 > > > > > Krylov Solver 2 2 17216 0 > > > > > Preconditioner 2 2 256 0 > > > > > Index Set 5 5 2596120 0 > > > > > Vec 40 40 62243224 0 > > > > > Vec Scatter 1 1 0 0 > > > > > > ======================================================================================================================== > > > > > Average time to get PetscTime(): 4.05312e-07 > Average time > > > > > for MPI_Barrier(): 7.62939e-07 > > > > > Average time for zero size MPI_Send(): 2.02656e-06 > > > > > OptionTable: -log_summary > > > > > > > > > > > > > > > The PETSc manual states that ratio should be close to 1. There's > quite a > > > > > few *(in bold)* which are >1 and MatAssemblyBegin seems to be very > big. So > > > > > what could be the cause? > > > > > > > > > > I wonder if it has to do the way I insert the matrix. My steps are: > > > > > (cartesian grids, i loop faster than j, fortran) > > > > > > > > > > For matrix A and rhs > > > > > > > > > > Insert left extreme cells values belonging to myid > > > > > > > > > > if (myid==0) then > > > > > > > > > > insert corner cells values > > > > > > > > > > insert south cells values > > > > > > > > > > insert internal cells values > > > > > > > > > > else if (myid==num_procs-1) then > > > > > > > > > > insert corner cells values > > > > > > > > > > insert north cells values > > > > > > > > > > insert internal cells values > > > > > > > > > > else > > > > > > > > > > insert internal cells values > > > > > > > > > > end if > > > > > > > > > > Insert right extreme cells values belonging to myid > > > > > > > > > > All these values are entered into a big_A(size_x*size_y,5) matrix. > int_A > > > > > stores the position of the values. I then do > > > > > > > > > > call MatZeroEntries(A_mat,ierr) > > > > > > > > > > do k=ksta_p+1,kend_p !for cells belonging to myid > > > > > > > > > > do kk=1,5 > > > > > > > > > > II=k-1 > > > > > > > > > > JJ=int_A(k,kk)-1 > > > > > > > > > > call > MatSetValues(A_mat,1,II,1,JJ,big_A(k,kk),ADD_VALUES,ierr) > > > > > end do > > > > > > > > > > end do > > > > > > > > > > call MatAssemblyBegin(A_mat,MAT_FINAL_ASSEMBLY,ierr) > > > > > > > > > > call MatAssemblyEnd(A_mat,MAT_FINAL_ASSEMBLY,ierr) > > > > > > > > > > > > > > > I wonder if the problem lies here.I used the big_A matrix because I > was > > > > > migrating from an old linear solver. Lastly, I was told to widen my > window > > > > > to 120 characters. May I know how do I do it? > > > > > > > > > > > > > > > > > > > > Thank you very much. > > > > > > > > > > Matthew Knepley wrote: > > > > > > > > > > > > > > > > > > > > > On Mon, Apr 14, 2008 at 8:43 AM, Ben Tay <zonexo at gmail.com> > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Matthew, > > > > > > > > > > > > > > I think you've misunderstood what I meant. What I'm trying to > say is > > > > > > > initially I've got a serial code. I tried to convert to a > parallel one. > > > > > > > > > > > > > > > > > > > > > > > > > Then > > > > > > > > > > > > > > > > > > > > > > > I tested it and it was pretty slow. Due to some work > requirement, I need > > > > > > > > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > > > > go back to make some changes to my code. Since the parallel is > not > > > > > > > > > > > > > > > > > > > > > > > > > working > > > > > > > > > > > > > > > > > > > > > > > well, I updated and changed the serial one. > > > > > > > > > > > > > > Well, that was a while ago and now, due to the updates and > changes, the > > > > > > > serial code is different from the old converted parallel code. > Some > > > > > > > > > > > > > > > > > > > > > > > > > files > > > > > > > > > > > > > > > > > > > > > > > were also deleted and I can't seem to get it working now. So I > thought I > > > > > > > might as well convert the new serial code to parallel. But I'm > not very > > > > > > > > > > > > > > > > > > > > > > > > > sure > > > > > > > > > > > > > > > > > > > > > > > what I should do 1st. > > > > > > > > > > > > > > Maybe I should rephrase my question in that if I just convert my > > > > > > > > > > > > > > > > > > > > > > > > > poisson > > > > > > > > > > > > > > > > > > > > > > > equation subroutine from a serial PETSc to a parallel PETSc > version, > > > > > > > > > > > > > > > > > > > > > > > > > will it > > > > > > > > > > > > > > > > > > > > > > > work? Should I expect a speedup? The rest of my code is still > serial. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > You should, of course, only expect speedup in the parallel parts > > > > > > > > > > > > Matt > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you very much. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Matthew Knepley wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I am not sure why you would ever have two codes. I never do > this. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > PETSc > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > is designed to write one code to run in serial and parallel. > The PETSc > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > part > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > should look identical. To test, run the code yo uhave verified > in > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > serial > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > output PETSc data structures (like Mat and Vec) using a binary > viewer. > > > > > > > > Then run in parallel with the same code, which will output the > same > > > > > > > > structures. Take the two files and write a small verification > code > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > that > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > loads both versions and calls MatEqual and VecEqual. > > > > > > > > > > > > > > > > Matt > > > > > > > > > > > > > > > > On Mon, Apr 14, 2008 at 5:49 AM, Ben Tay <zonexo at gmail.com> > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you Matthew. Sorry to trouble you again. > > > > > > > > > > > > > > > > > > I tried to run it with -log_summary output and I found that > there's > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > some > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > errors in the execution. Well, I was busy with other things > and I > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > just > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > came > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > back to this problem. Some of my files on the server has > also been > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > deleted. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It has been a while and I remember that it worked before, > only > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > much > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > slower. > > > > > > > > > > > > > > > > > > Anyway, most of the serial code has been updated and maybe > it's > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > easier > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > convert the new serial code instead of debugging on the old > parallel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > code > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > now. I believe I can still reuse part of the old parallel > code. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > However, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > hope I can approach it better this time. > > > > > > > > > > > > > > > > > > So supposed I need to start converting my new serial code to > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > parallel. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > There's 2 eqns to be solved using PETSc, the momentum and > poisson. I > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > also > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > need to parallelize other parts of my code. I wonder which > route is > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > best: > > > > > > > > > > > > > > > > > > 1. Don't change the PETSc part ie continue using > PETSC_COMM_SELF, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > modify > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > other parts of my code to parallel e.g. looping, updating of > values > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > etc. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Once the execution is fine and speedup is reasonable, then > modify > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > PETSc > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > part - poisson eqn 1st followed by the momentum eqn. > > > > > > > > > > > > > > > > > > 2. Reverse the above order ie modify the PETSc part - > poisson eqn > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1st > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > followed by the momentum eqn. Then do other parts of my > code. > > > > > > > > > > > > > > > > > > I'm not sure if the above 2 mtds can work or if there will > be > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > conflicts. Of > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > course, an alternative will be: > > > > > > > > > > > > > > > > > > 3. Do the poisson, momentum eqns and other parts of the code > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > separately. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > That is, code a standalone parallel poisson eqn and use > samples > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > values > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > test it. Same for the momentum and other parts of the code. > When > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > each of > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > them is working, combine them to form the full parallel > code. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > However, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > this > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > will be much more troublesome. > > > > > > > > > > > > > > > > > > I hope someone can give me some recommendations. > > > > > > > > > > > > > > > > > > Thank you once again. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Matthew Knepley wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1) There is no way to have any idea what is going on in > your code > > > > > > > > > > without -log_summary output > > > > > > > > > > > > > > > > > > > > 2) Looking at that output, look at the percentage taken by > the > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > solver > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > KSPSolve event. I suspect it is not the biggest component, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > because > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > it is very scalable. > > > > > > > > > > > > > > > > > > > > Matt > > > > > > > > > > > > > > > > > > > > On Sun, Apr 13, 2008 at 4:12 AM, Ben Tay > <zonexo at gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > I've a serial 2D CFD code. As my grid size requirement > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > increases, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > simulation takes longer. Also, memory requirement > becomes a > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > problem. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Grid > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > size 've reached 1200x1200. Going higher is not possible > due to > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > memory > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > problem. > > > > > > > > > > > > > > > > > > > > > > I tried to convert my code to a parallel one, following > the > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > examples > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > given. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I also need to restructure parts of my code to enable > parallel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > looping. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1st changed the PETSc solver to be parallel enabled and > then I > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > restructured > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > parts of my code. I proceed on as longer as the answer > for a > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > simple > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > test > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > case is correct. I thought it's not really possible to > do any > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > speed > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > testing > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > since the code is not fully parallelized yet. When I > finished > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > during > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > most of > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > the conversion, I found that in the actual run that it > is much > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > slower, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > although the answer is correct. > > > > > > > > > > > > > > > > > > > > > > So what is the remedy now? I wonder what I should do to > check > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > what's > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrong. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Must I restart everything again? Btw, my grid size is > 1200x1200. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > believed > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > it should be suitable for parallel run of 4 processors? > Is that > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > so? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
