> On Jul 10, 2015, at 12:34 PM, Ganesh Vijayakumar <[email protected]> 
> wrote:
> 
> Hello,
> 
> On Thu, Jul 9, 2015 at 7:32 PM Barry Smith <[email protected]> wrote:
> 
>   Ok, it is block Jacobi with ICC on each block (one per process) so 
> -ksp_type cg -pc_type bjacobi -sub_pc_type icc with PETSc should give similar 
> results to what they get.
> 
> >
> > Where is all the data?  It should list all the events and time it spends in 
> > each. Did  you use PetscOptionsSetValue() to provide -log_summary? That 
> > won't work you need to pass it on the command line or in the PETSC_OPTIONS 
> > environmental variable or in a file called petscrc
> 
> Using Petsc Release Version 3.5.3, Jan, 31, 2015
> 
>                          Max       Max/Min        Avg      Total
> Time (sec):           9.756e+01      1.00369   9.726e+01
> Objects:              4.500e+01      1.00000   4.500e+01
> Flops:                1.256e+08      1.17291   1.184e+08  3.031e+10
> Flops/sec:            1.292e+06      1.17364   1.217e+06  3.116e+08
> MPI Messages:         3.956e+03     21.50000   1.167e+03  2.986e+05
> MPI Message Lengths:  6.769e+06      4.61934   3.787e+03  1.131e+09
> MPI Reductions:       3.120e+02      1.00000
> 
> Flop counting convention: 1 flop = 1 real number operation of type 
> (multiply/divide/add/subtract)
>                             e.g., VecAXPY() for real vectors of length N --> 
> 2N flops
>                             and VecAXPY() for complex vectors of length N --> 
> 8N flops
> 
> Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  
> -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total   counts   %Total   
>   Avg         %Total   counts   %Total
>  0:      Main Stage: 9.7259e+01 100.0%  3.0311e+10 100.0%  2.986e+05 100.0%  
> 3.787e+03      100.0%  3.110e+02  99.7%
> 
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on interpreting 
> output.
> Phase summary info:
>    Count: number of times phase was executed
>    Time and Flops: Max - maximum over all processors
>                    Ratio - ratio of maximum to minimum over all processors
>    Mess: number of messages sent
>    Avg. len: average message length (bytes)
>    Reduct: number of global reductions
>    Global: entire computation
>    Stage: stages of a computation. Set stages with PetscLogStagePush() and 
> PetscLogStagePop().
>       %T - percent time in this phase         %F - percent flops in this phase
>       %M - percent messages in this phase     %L - percent message lengths in 
> this phase
>       %R - percent reductions in this phase
>    Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over 
> all processors)
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flops                          
>    --- Global ---  --- Stage ---   Total
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len 
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
> 
> --- Event Stage 0: Main Stage
> 
> VecMDot               75 1.0 1.1153e-01 2.0 2.28e+07 1.0 0.0e+00 0.0e+00 
> 7.5e+01  0 19  0  0 24   0 19  0  0 24 51865
> VecNorm              105 1.0 2.6864e-01 1.1 1.02e+07 1.0 0.0e+00 0.0e+00 
> 1.0e+02  0  8  0  0 34   0  8  0  0 34  9580
> VecScale              90 1.0 2.2329e-02 6.7 4.35e+06 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  4  0  0  0   0  4  0  0  0 49394
> VecSet               121 1.0 1.1327e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAXPY               15 1.0 1.6739e-03 1.2 1.45e+06 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  1  0  0  0   0  1  0  0  0 219629
> VecWAXPY              15 1.0 2.1994e-03 1.9 7.25e+05 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0  1  0  0  0   0  1  0  0  0 83578
> VecMAXPY              90 1.0 2.7625e-02 1.8 3.01e+07 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0 25  0  0  0   0 25  0  0  0 275924
> VecAssemblyBegin      30 1.0 1.2747e-02 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 
> 9.0e+01  0  0  0  0 29   0  0  0  0 29     0
> VecAssemblyEnd        30 1.0 5.1475e-0426.7 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecScatterBegin       90 1.0 1.4369e-02 5.4 0.00e+00 0.0 2.9e+05 3.9e+03 
> 0.0e+00  0  0 98 99  0   0  0 98 99  0     0
> VecScatterEnd         90 1.0 4.2581e-0211.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatMult               90 1.0 1.6290e-01 1.6 5.63e+07 1.5 2.9e+05 3.9e+03 
> 0.0e+00  0 42 98 99  0   0 42 98 99  0 77813
> MatConvert             5 1.0 4.0061e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAssemblyBegin      10 1.0 1.2128e-01 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 2.0e+01  0  0  0  0  6   0  0  0  0  6     0
> MatAssemblyEnd        10 1.0 5.6291e-02 1.0 0.00e+00 0.0 6.5e+03 9.6e+02 
> 8.0e+00  0  0  2  1  3   0  0  2  1  3     0
> MatGetRowIJ           10 1.0 9.0599e-06 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatZeroEntries         5 1.0 5.0242e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatView                5 1.0 3.0882e-03 3.5 0.00e+00 0.0 0.0e+00 0.0e+00 
> 5.0e+00  0  0  0  0  2   0  0  0  0  2     0
> KSPGMRESOrthog        75 1.0 1.2176e-01 1.9 4.56e+07 1.0 0.0e+00 0.0e+00 
> 7.5e+01  0 38  0  0 24   0 38  0  0 24 95014
> KSPSetUp               1 1.0 2.6391e-03 2.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve              15 1.0 9.3209e+01 1.0 1.26e+08 1.2 2.9e+05 3.9e+03 
> 1.8e+02 96100 98 99 59  96100 98 99 59   325

  The next two lines are the important ones. It is spending 80% of the time in 
setting up the hypre BoomerAMG preconditioner and 16% of the time applying it. 
(everything else is trivial).

> PCSetUp                5 1.0 7.7425e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 4.0e+00 80  0  0  0  1  80  0  0  0  1     0
> PCApply               75 1.0 1.5272e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00 16  0  0  0  0  16  0  0  0  0     0
> ------------------------------------------------------------------------------------------------------------------------
> 
> Memory usage is given in bytes:
> 
> Object Type          Creations   Destructions     Memory  Descendants' Mem.
> Reports information only for process 0.
> 
> --- Event Stage 0: Main Stage
> 
>               Vector    36             34     12361096     0
>       Vector Scatter     1              1         1060     0
>               Matrix     3              3     10175808     0
>        Krylov Solver     1              1        18960     0
>       Preconditioner     1              1         1096     0
>               Viewer     1              0            0     0
>            Index Set     2              2        38884     0
> ========================================================================================================================
> Average time to get PetscTime(): 
> Average time for MPI_Barrier(): 1.7786e-05
> Average time for zero size MPI_Send(): 0.000176195

    The times for MPI_Barrier and MPI_Send() are HUGE on your machine. This 
will limit how fast anything can run. I am surprised they are so large; isn't 
stampede suppose to be a high end parallel machine? 

> #PETSc Option Table entries:
> -info blah
> -log_summary
> -mat_view ::ascii_info
> -parallel
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
> Configure options: --with-x=0 -with-pic 
> --with-external-packages-dir=/opt/apps/intel13/mvapich2_1_9/petsc/3.5/externalpackages
>  --with-mpi-compilers=1 --with-mpi-dir=/opt/apps/intel13/mvapich2/1.9 
> --with-scalar-type=real --with-shared-libraries=1 --with-precision=double 
> --with-hypre=1 --download-hypre --with-ml=1 --download-ml --with-ml=1 
> --download-ml --with-superlu_dist=1 --download-superlu_dist --with-superlu=1 
> --download-superlu --with-parmetis=1 --download-parmetis --with-metis=1 
> --download-metis --with-spai=1 --download-spai --with-mumps=1 
> --download-mumps --with-parmetis=1 --download-parmetis --with-metis=1 
> --download-metis --with-scalapack=1 --download-scalapack --with-blacs=1 
> --download-blacs --with-spooles=1 --download-spooles --with-hdf5=1 
> --with-hdf5-dir=/opt/apps/intel13/mvapich2_1_9/phdf5/1.8.9 
> --with-debugging=no 
> --with-blas-lapack-dir=/opt/apps/intel/13/composer_xe_2013.2.146/mkl 
> --with-mpiexec=mpirun_rsh --COPTFLAGS= --FOPTFLAGS= --CXXOPTFLAGS=

  You should set -COPTFLAGS= --FOPTFLAGS= --CXXOPTFLAGS= to at least -O1 maybe 
-O3.  Currently you are compiling without optimization which is BAD.


> -----------------------------------------
> Libraries compiled on Thu Apr  2 10:06:57 2015 on 
> staff.stampede.tacc.utexas.edu
> Machine characteristics: 
> Linux-2.6.32-431.17.1.el6.x86_64-x86_64-with-centos-6.6-Final
> Using PETSc directory: /opt/apps/intel13/mvapich2_1_9/petsc/3.5
> Using PETSc arch: sandybridge
> -----------------------------------------
> 
> Using C compiler: /opt/apps/intel13/mvapich2/1.9/bin/mpicc  -fPIC -wd1572   
> ${COPTFLAGS} ${CFLAGS}
> Using Fortran compiler: /opt/apps/intel13/mvapich2/1.9/bin/mpif90  -fPIC    
> ${FOPTFLAGS} ${FFLAGS}
> -----------------------------------------
> 
> Using include paths: 
> -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/include 
> -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/include 
> -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/include 
> -I/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/include 
> -I/opt/apps/intel13/mvapich2_1_9/phdf5/1.8.9/include 
> -I/opt/apps/intel13/mvapich2/1.9/include
> -----------------------------------------
> Using C linker: /opt/apps/intel13/mvapich2/1.9/bin/mpicc
> Using Fortran linker: /opt/apps/intel13/mvapich2/1.9/bin/mpif90
> Using libraries: 
> -Wl,-rpath,/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/lib 
> -L/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/lib -lpetsc 
> -Wl,-rpath,/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/lib 
> -L/opt/apps/intel13/mvapich2_1_9/petsc/3.5/sandybridge/lib -lsuperlu_4.3 
> -lHYPRE -Wl,-rpath,/opt/ofed/lib64 -L/opt/ofed/lib64 
> -Wl,-rpath,/opt/apps/limic2/0.5.5/lib -L/opt/apps/limic2/0.5.5/lib 
> -Wl,-rpath,/opt/apps/intel13/mvapich2/1.9/lib 
> -L/opt/apps/intel13/mvapich2/1.9/lib 
> -Wl,-rpath,/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64 
> -L/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64 
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.4.7 
> -L/usr/lib/gcc/x86_64-redhat-linux/4.4.7 -lmpichcxx -lml -lmpichcxx -lspai 
> -lsuperlu_dist_3.3 -lcmumps -ldmumps -lsmumps -lzmumps -lmumps_common -lpord 
> -lscalapack 
> -Wl,-rpath,/opt/apps/intel/13/composer_xe_2013.2.146/mkl/lib/intel64 
> -L/opt/apps/intel/13/composer_xe_2013.2.146/mkl/lib/intel64 -lmkl_intel_lp64 
> -lmkl_sequential -lmkl_core -lpthread -lm -lparmetis -lmetis -lpthread -lssl 
> -lcrypto -Wl,-rpath,/opt/apps/intel13/mvapich2_1_9/phdf5/1.8.9/lib 
> -L/opt/apps/intel13/mvapich2_1_9/phdf5/1.8.9/lib -lhdf5hl_fortran 
> -lhdf5_fortran -lhdf5_hl -lhdf5 -lmpichf90 -lifport -lifcore -lm -lmpichcxx 
> -Wl,-rpath,/opt/ofed/lib64 -L/opt/ofed/lib64 
> -Wl,-rpath,/opt/apps/limic2/0.5.5/lib -L/opt/apps/limic2/0.5.5/lib 
> -Wl,-rpath,/opt/ofed/lib64 -L/opt/ofed/lib64 
> -Wl,-rpath,/opt/apps/limic2/0.5.5/lib -L/opt/apps/limic2/0.5.5/lib -ldl 
> -Wl,-rpath,/opt/apps/intel13/mvapich2/1.9/lib 
> -L/opt/apps/intel13/mvapich2/1.9/lib -lmpich -lopa -lmpl -libmad -lrdmacm 
> -libumad -libverbs -lrt -llimic2 -lpthread -Wl,-rpath,/opt/ofed/lib64 
> -L/opt/ofed/lib64 -Wl,-rpath,/opt/apps/limic2/0.5.5/lib 
> -L/opt/apps/limic2/0.5.5/lib -Wl,-rpath,/opt/ofed/lib64 -L/opt/ofed/lib64 
> -Wl,-rpath,/opt/apps/limic2/0.5.5/lib -L/opt/apps/limic2/0.5.5/lib 
> -Wl,-rpath,/opt/apps/intel13/mvapich2/1.9/lib 
> -L/opt/apps/intel13/mvapich2/1.9/lib 
> -Wl,-rpath,/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64 
> -L/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64 
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.4.7 
> -L/usr/lib/gcc/x86_64-redhat-linux/4.4.7 
> -Wl,-rpath,/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64 
> -Wl,-rpath,/opt/apps/limic2/0.5.5/lib -Wl,-rpath,/opt/ofed/lib64 
> -Wl,-rpath,/opt/apps/intel13/mvapich2/1.9/lib -limf -lsvml -lirng -lipgo 
> -ldecimal -lcilkrts -lstdc++ -lgcc_s -lirc -lirc_s -Wl,-rpath,/opt/ofed/lib64 
> -L/opt/ofed/lib64 -Wl,-rpath,/opt/apps/limic2/0.5.5/lib 
> -L/opt/apps/limic2/0.5.5/lib -Wl,-rpath,/opt/ofed/lib64 -L/opt/ofed/lib64 
> -Wl,-rpath,/opt/apps/limic2/0.5.5/lib -L/opt/apps/limic2/0.5.5/lib 
> -Wl,-rpath,/opt/apps/intel13/mvapich2/1.9/lib 
> -L/opt/apps/intel13/mvapich2/1.9/lib 
> -Wl,-rpath,/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64 
> -L/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/intel64 
> -Wl,-rpath,/usr/lib/gcc/x86_64-redhat-linux/4.4.7 
> -L/usr/lib/gcc/x86_64-redhat-linux/4.4.7 -ldl
> -----------------------------------------
> 
> Finalising parallel run
> 
>  
> > 1. How I do I tell PETSC that my matrix is symmetric. I tried setting my 
> > matrix as follows... but am apprehensive of it.
> >
> > MatCreateSBAIJ(PETSC_COMM_WORLD, 1, nCellsCurProc, nCellsCurProc, 
> > nTotalCells, nTotalCells, 10, NULL, 5, NULL, &A);
> >
> >  Could I still use MatSetValue on both upper and lower diagonal part of the 
> > matrix. Will PETSC understand that it's redundant?
> 
>   Yes, run with -mat_ignore_lower_triangular or call 
> MatSetOption(mat,MAT_IGNORE_LOWER_TRIANGULAR,PETSC_TRUE)
> 
> This is very useful.. thanks.
> 
> I have a question on setting block sizes. Should I create 1 block per 
> processor?

  No the block size has nothing to do with parallelism; it is 1 in your case 
because you have a scalar PDE (pressure) that you are solving.

> If so what do I set the d_nz and o_nz as? Right now I allocate memory for 10 
> non-zero elements per row that are local to the processor and 5 non-zero 
> elements that are non-local. So my understanding was that 
> 
> MatCreateSBAIJ(PETSC_COMM_WORLD, 1, nCellsCurProc, nCellsCurProc, 
> nTotalCells, nTotalCells, 10, NULL, 5, NULL, &A);
> 
> should become
> 
> MatCreateSBAIJ(PETSC_COMM_WORLD, nCellsCurProc, nCellsCurProc, nCellsCurProc, 
> nTotalCells, nTotalCells, 10*nCellsCurProc, NULL, 5*nCellsCurProc, NULL, &A);
> 
> 
> But PETSC doesn't seem to like this. It complains that it's out of memory and 
> throws a whole lot of error messages. Clearly something's wrong. Could you 
> please tell me what is.
> 
> > 2. Do I need PCFactorSetShiftType(pc,MAT_SHIFT_POSITIVE_DEFINITE); ?
> 
>   I hope not. But you might.
> 
> Ok. I tried with and without it... doesn't seem to make a difference. So off 
> for now. Will turn it on if necessary.
> 
> > 3. What does KSPSetReusePreconditioner(ksp, PETSC_TRUE) do? Should I use it?
> 
>   Not at first. What it does it not build a new preconditioner for each 
> solve. If the matrix is changing "slowly" you can often get away with setting 
> this for some number of linear solvers, then set it back to false for the 
> next solve then set it to true again for some number of linear solvers.  You 
> could try it with hypre, say keeping it the same for 10, 50, 100 solves and 
> see what happens time wise.
> 
> This was most useful. I did two things. First I shifted the creation of the 
> KSP object to the initialization stage. So no more creation and deletion of 
> KSP objects. Second I set ReusePreconditioner to true when the matrix changes 
> and false when it doesn't. All of this got my execution time down from 250s 
> to about 103s! I think that's great. Thanks again.
> 
> ganesh

Reply via email to