> On Jul 10, 2015, at 12:34 PM, Ganesh Vijayakumar <[email protected]>
> wrote:
>
> Hello,
>
> On Thu, Jul 9, 2015 at 7:32 PM Barry Smith <[email protected]> wrote:
>
> Ok, it is block Jacobi with ICC on each block (one per process) so
> -ksp_type cg -pc_type bjacobi -sub_pc_type icc with PETSc should give similar
> results to what they get.
>
> >
> > Where is all the data? It should list all the events and time it spends in
> > each. Did you use PetscOptionsSetValue() to provide -log_summary? That
> > won't work you need to pass it on the command line or in the PETSC_OPTIONS
> > environmental variable or in a file called petscrc
>
> Using Petsc Release Version 3.5.3, Jan, 31, 2015
>
> Max Max/Min Avg Total
> Time (sec): 9.756e+01 1.00369 9.726e+01
> Objects: 4.500e+01 1.00000 4.500e+01
> Flops: 1.256e+08 1.17291 1.184e+08 3.031e+10
> Flops/sec: 1.292e+06 1.17364 1.217e+06 3.116e+08
> MPI Messages: 3.956e+03 21.50000 1.167e+03 2.986e+05
> MPI Message Lengths: 6.769e+06 4.61934 3.787e+03 1.131e+09
> MPI Reductions: 3.120e+02 1.00000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
> e.g., VecAXPY() for real vectors of length N -->
> 2N flops
> and VecAXPY() for complex vectors of length N -->
> 8N flops
>
> Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages ---
> -- Message Lengths -- -- Reductions --
> Avg %Total Avg %Total counts %Total
> Avg %Total counts %Total
> 0: Main Stage: 9.7259e+01 100.0% 3.0311e+10 100.0% 2.986e+05 100.0%
> 3.787e+03 100.0% 3.110e+02 99.7%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on interpreting
> output.
> Phase summary info:
> Count: number of times phase was executed
> Time and Flops: Max - maximum over all processors
> Ratio - ratio of maximum to minimum over all processors
> Mess: number of messages sent
> Avg. len: average message length (bytes)
> Reduct: number of global reductions
> Global: entire computation
> Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
> %T - percent time in this phase %F - percent flops in this phase
> %M - percent messages in this phase %L - percent message lengths in
> this phase
> %R - percent reductions in this phase
> Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over
> all processors)
> ------------------------------------------------------------------------------------------------------------------------
> Event Count Time (sec) Flops
> --- Global --- --- Stage --- Total
> Max Ratio Max Ratio Max Ratio Mess Avg len
> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> VecMDot 75 1.0 1.1153e-01 2.0 2.28e+07 1.0 0.0e+00 0.0e+00
> 7.5e+01 0 19 0 0 24 0 19 0 0 24 51865
> VecNorm 105 1.0 2.6864e-01 1.1 1.02e+07 1.0 0.0e+00 0.0e+00
> 1.0e+02 0 8 0 0 34 0 8 0 0 34 9580
> VecScale 90 1.0 2.2329e-02 6.7 4.35e+06 1.0 0.0e+00 0.0e+00
> 0.0e+00 0 4 0 0 0 0 4 0 0 0 49394
> VecSet 121 1.0 1.1327e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecAXPY 15 1.0 1.6739e-03 1.2 1.45e+06 1.0 0.0e+00 0.0e+00
> 0.0e+00 0 1 0 0 0 0 1 0 0 0 219629
> VecWAXPY 15 1.0 2.1994e-03 1.9 7.25e+05 1.0 0.0e+00 0.0e+00
> 0.0e+00 0 1 0 0 0 0 1 0 0 0 83578
> VecMAXPY 90 1.0 2.7625e-02 1.8 3.01e+07 1.0 0.0e+00 0.0e+00
> 0.0e+00 0 25 0 0 0 0 25 0 0 0 275924
> VecAssemblyBegin 30 1.0 1.2747e-02 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 9.0e+01 0 0 0 0 29 0 0 0 0 29 0
> VecAssemblyEnd 30 1.0 5.1475e-0426.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> VecScatterBegin 90 1.0 1.4369e-02 5.4 0.00e+00 0.0 2.9e+05 3.9e+03
> 0.0e+00 0 0 98 99 0 0 0 98 99 0 0
> VecScatterEnd 90 1.0 4.2581e-0211.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatMult 90 1.0 1.6290e-01 1.6 5.63e+07 1.5 2.9e+05 3.9e+03
> 0.0e+00 0 42 98 99 0 0 42 98 99 0 77813
> MatConvert 5 1.0 4.0061e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatAssemblyBegin 10 1.0 1.2128e-01 3.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+01 0 0 0 0 6 0 0 0 0 6 0
> MatAssemblyEnd 10 1.0 5.6291e-02 1.0 0.00e+00 0.0 6.5e+03 9.6e+02
> 8.0e+00 0 0 2 1 3 0 0 2 1 3 0
> MatGetRowIJ 10 1.0 9.0599e-06 0.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatZeroEntries 5 1.0 5.0242e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> MatView 5 1.0 3.0882e-03 3.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 5.0e+00 0 0 0 0 2 0 0 0 0 2 0
> KSPGMRESOrthog 75 1.0 1.2176e-01 1.9 4.56e+07 1.0 0.0e+00 0.0e+00
> 7.5e+01 0 38 0 0 24 0 38 0 0 24 95014
> KSPSetUp 1 1.0 2.6391e-03 2.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
> KSPSolve 15 1.0 9.3209e+01 1.0 1.26e+08 1.2 2.9e+05 3.9e+03
> 1.8e+02 96100 98 99 59 96100 98 99 59 325
The next two lines are the important ones. It is spending 80% of the time in
setting up the hypre BoomerAMG preconditioner and 16% of the time applying it.
(everything else is trivial).
> PCSetUp 5 1.0 7.7425e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 4.0e+00 80 0 0 0 1 80 0 0 0 1 0
> PCApply 75 1.0 1.5272e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 16 0 0 0 0 16 0 0 0 0 0
> ------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type Creations Destructions Memory Descendants' Mem.
> Reports information only for process 0.
>
> --- Event Stage 0: Main Stage
>
> Vector 36 34 12361096 0
> Vector Scatter 1 1 1060 0
> Matrix 3 3 10175808 0
> Krylov Solver 1 1 18960 0
> Preconditioner 1 1 1096 0
> Viewer 1 0 0 0
> Index Set 2 2 38884 0
> ========================================================================================================================
> Average time to get PetscTime():
> Average time for MPI_Barrier(): 1.7786e-05
> Average time for zero size MPI_Send(): 0.000176195
The times for MPI_Barrier and MPI_Send() are HUGE on your machine. This
will limit how fast anything can run. I am surprised they are so large; isn't
stampede suppose to be a high end parallel machine?
> #PETSc Option Table entries:
> -info blah
> -log_summary
> -mat_view ::ascii_info
> -parallel
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
> Configure options: --with-x=0 -with-pic
> --with-external-packages-dir=/opt/apps/intel13/mvapich2_1_9/petsc/3.5/externalpackages
> --with-mpi-compilers=1 --with-mpi-dir=/opt/apps/intel13/mvapich2/1.9
> --with-scalar-type=real --with-shared-libraries=1 --with-precision=double
> --with-hypre=1 --download-hypre --with-ml=1 --download-ml --with-ml=1
> --download-ml --with-superlu_dist=1 --download-superlu_dist --with-superlu=1
> --download-superlu --with-parmetis=1 --download-parmetis --with-metis=1
> --download-metis --with-spai=1 --download-spai --with-mumps=1
> --download-mumps --with-parmetis=1 --download-parmetis --with-metis=1
> --download-metis --with-scalapack=1 --download-scalapack --with-blacs=1
> --download-blacs --with-spooles=1 --download-spooles --with-hdf5=1
> --with-hdf5-dir=/opt/apps/intel13/mvapich2_1_9/phdf5/1.8.9
> --with-debugging=no
> --with-blas-lapack-dir=/opt/apps/intel/13/composer_xe_2013.2.146/mkl
> --with-mpiexec=mpirun_rsh --COPTFLAGS= --FOPTFLAGS= --CXXOPTFLAGS=
You should set -COPTFLAGS= --FOPTFLAGS= --CXXOPTFLAGS= to at least -O1 maybe
-O3. Currently you are compiling without optimization which is BAD.
> -----------------------------------------
> Libraries compiled on Thu Apr 2 10:06:57 2015 on
> staff.stampede.tacc.utexas.edu
> Machine characteristics:
> Linux-2.6.32-431.17.1.el6.x86_64-x86_64-with-centos-6.6-Final
> Using PETSc directory: /opt/apps/intel13/mvapich2_1_9/petsc/3.5
> Using PETSc arch: sandybridge
> -----------------------------------------
>
> Using C compiler: /opt/apps/intel13/mvapich2/1.9/bin/mpicc -fPIC -wd1572
> ${COPTFLAGS} ${CFLAGS}
> Using Fortran compiler: /opt/apps/intel13/mvapich2/1.9/bin/mpif90 -fPIC
> ${FOPTFLAGS} ${FFLAGS}
> -----------------------------------------
>
> Using include paths:
> -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/include
> -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/include
> -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/include
> -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/include
> -I/opt/apps/intel13/mvapich2_1_9/phdf5/1.8.9/include
> -I/opt/apps/intel13/mvapich2/1.9/include
> -----------------------------------------
> Using C linker: /opt/apps/intel13/mvapich2/1.9/bin/mpicc
> Using Fortran linker: /opt/apps/intel13/mvapich2/1.9/bin/mpif90
> Using libraries:
> -Wl,-rpath,/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/lib
> -L/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/lib -lpetsc
> -Wl,-rpath,/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/lib
> -L/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/lib -lsuperlu_4.3
> -lHYPRE -Wl,-rpath,/opt/ofed/lib64 -L/opt/ofed/lib64
> -Wl,-rpath,/opt/apps/limic2/0.5.5/lib -L/opt/apps/limic2/0.5.5/lib
> -Wl,-rpath,/opt/apps/intel13/mvapich2/1.9/lib
> -L/opt/apps/intel13/mvapich2/1.9/lib
> -Wl,-rpath,/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64
> -L/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.4.7
> -L/usr/lib/gcc/x86_64-redhat-linux/4.4.7 -lmpichcxx -lml -lmpichcxx -lspai
> -lsuperlu_dist_3.3 -lcmumps -ldmumps -lsmumps -lzmumps -lmumps_common -lpord
> -lscalapack
> -Wl,-rpath,/opt/apps/intel/13/composer_xe_2013.2.146/mkl/lib/intel64
> -L/opt/apps/intel/13/composer_xe_2013.2.146/mkl/lib/intel64 -lmkl_intel_lp64
> -lmkl_sequential -lmkl_core -lpthread -lm -lparmetis -lmetis -lpthread -lssl
> -lcrypto -Wl,-rpath,/opt/apps/intel13/mvapich2_1_9/phdf5/1.8.9/lib
> -L/opt/apps/intel13/mvapich2_1_9/phdf5/1.8.9/lib -lhdf5hl_fortran
> -lhdf5_fortran -lhdf5_hl -lhdf5 -lmpichf90 -lifport -lifcore -lm -lmpichcxx
> -Wl,-rpath,/opt/ofed/lib64 -L/opt/ofed/lib64
> -Wl,-rpath,/opt/apps/limic2/0.5.5/lib -L/opt/apps/limic2/0.5.5/lib
> -Wl,-rpath,/opt/ofed/lib64 -L/opt/ofed/lib64
> -Wl,-rpath,/opt/apps/limic2/0.5.5/lib -L/opt/apps/limic2/0.5.5/lib -ldl
> -Wl,-rpath,/opt/apps/intel13/mvapich2/1.9/lib
> -L/opt/apps/intel13/mvapich2/1.9/lib -lmpich -lopa -lmpl -libmad -lrdmacm
> -libumad -libverbs -lrt -llimic2 -lpthread -Wl,-rpath,/opt/ofed/lib64
> -L/opt/ofed/lib64 -Wl,-rpath,/opt/apps/limic2/0.5.5/lib
> -L/opt/apps/limic2/0.5.5/lib -Wl,-rpath,/opt/ofed/lib64 -L/opt/ofed/lib64
> -Wl,-rpath,/opt/apps/limic2/0.5.5/lib -L/opt/apps/limic2/0.5.5/lib
> -Wl,-rpath,/opt/apps/intel13/mvapich2/1.9/lib
> -L/opt/apps/intel13/mvapich2/1.9/lib
> -Wl,-rpath,/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64
> -L/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.4.7
> -L/usr/lib/gcc/x86_64-redhat-linux/4.4.7
> -Wl,-rpath,/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64
> -Wl,-rpath,/opt/apps/limic2/0.5.5/lib -Wl,-rpath,/opt/ofed/lib64
> -Wl,-rpath,/opt/apps/intel13/mvapich2/1.9/lib -limf -lsvml -lirng -lipgo
> -ldecimal -lcilkrts -lstdc++ -lgcc_s -lirc -lirc_s -Wl,-rpath,/opt/ofed/lib64
> -L/opt/ofed/lib64 -Wl,-rpath,/opt/apps/limic2/0.5.5/lib
> -L/opt/apps/limic2/0.5.5/lib -Wl,-rpath,/opt/ofed/lib64 -L/opt/ofed/lib64
> -Wl,-rpath,/opt/apps/limic2/0.5.5/lib -L/opt/apps/limic2/0.5.5/lib
> -Wl,-rpath,/opt/apps/intel13/mvapich2/1.9/lib
> -L/opt/apps/intel13/mvapich2/1.9/lib
> -Wl,-rpath,/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64
> -L/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.4.7
> -L/usr/lib/gcc/x86_64-redhat-linux/4.4.7 -ldl
> -----------------------------------------
>
> Finalising parallel run
>
>
> > 1. How I do I tell PETSC that my matrix is symmetric. I tried setting my
> > matrix as follows... but am apprehensive of it.
> >
> > MatCreateSBAIJ(PETSC_COMM_WORLD, 1, nCellsCurProc, nCellsCurProc,
> > nTotalCells, nTotalCells, 10, NULL, 5, NULL, &A);
> >
> > Could I still use MatSetValue on both upper and lower diagonal part of the
> > matrix. Will PETSC understand that it's redundant?
>
> Yes, run with -mat_ignore_lower_triangular or call
> MatSetOption(mat,MAT_IGNORE_LOWER_TRIANGULAR,PETSC_TRUE)
>
> This is very useful.. thanks.
>
> I have a question on setting block sizes. Should I create 1 block per
> processor?
No the block size has nothing to do with parallelism; it is 1 in your case
because you have a scalar PDE (pressure) that you are solving.
> If so what do I set the d_nz and o_nz as? Right now I allocate memory for 10
> non-zero elements per row that are local to the processor and 5 non-zero
> elements that are non-local. So my understanding was that
>
> MatCreateSBAIJ(PETSC_COMM_WORLD, 1, nCellsCurProc, nCellsCurProc,
> nTotalCells, nTotalCells, 10, NULL, 5, NULL, &A);
>
> should become
>
> MatCreateSBAIJ(PETSC_COMM_WORLD, nCellsCurProc, nCellsCurProc, nCellsCurProc,
> nTotalCells, nTotalCells, 10*nCellsCurProc, NULL, 5*nCellsCurProc, NULL, &A);
>
>
> But PETSC doesn't seem to like this. It complains that it's out of memory and
> throws a whole lot of error messages. Clearly something's wrong. Could you
> please tell me what is.
>
> > 2. Do I need PCFactorSetShiftType(pc,MAT_SHIFT_POSITIVE_DEFINITE); ?
>
> I hope not. But you might.
>
> Ok. I tried with and without it... doesn't seem to make a difference. So off
> for now. Will turn it on if necessary.
>
> > 3. What does KSPSetReusePreconditioner(ksp, PETSC_TRUE) do? Should I use it?
>
> Not at first. What it does it not build a new preconditioner for each
> solve. If the matrix is changing "slowly" you can often get away with setting
> this for some number of linear solvers, then set it back to false for the
> next solve then set it to true again for some number of linear solvers. You
> could try it with hypre, say keeping it the same for 10, 50, 100 solves and
> see what happens time wise.
>
> This was most useful. I did two things. First I shifted the creation of the
> KSP object to the initialization stage. So no more creation and deletion of
> KSP objects. Second I set ReusePreconditioner to true when the matrix changes
> and false when it doesn't. All of this got my execution time down from 250s
> to about 103s! I think that's great. Thanks again.
>
> ganesh