Re: [petsc-users] HPCToolKit/HPCViewer on OS X

Griffith, Boyce Eugene Thu, 14 Jan 2016 12:32:25 -0800

On Jan 14, 2016, at 3:09 PM, Barry Smith 
<[email protected]<mailto:[email protected]>> wrote:



On Jan 14, 2016, at 2:01 PM, Griffith, Boyce Eugene 
<[email protected]<mailto:[email protected]>> wrote:


On Jan 14, 2016, at 2:24 PM, Barry Smith 
<[email protected]<mailto:[email protected]>> wrote:


Matt is right, there is a lot of "missing" time from the output. Please send 
the output from -ksp_view so we can see exactly what solver is being used.

>From the output we have:

 Nonlinear solver 78 % of the time (so your "setup code" outside of PETSC is 
taking about 22% of the time)
 Linear solver 77 % of the time (this is reasonable pretty much the entire cost 
of the nonlinear solve is the linear solve)
 Time to set up the preconditioner is 19%  (10 + 9)
 Time of iteration in KSP 35 % (this is the sum of the vector operations and 
MatMult() and MatSolve())

  So 77 - (19 + 35) = 23 % unexplained time inside the linear solver (custom 
preconditioner???)

 Also getting the results with Instruments or HPCToolkit would be useful (so 
long as we don't need to install HPCTool ourselves to see the results).

Thanks, Barry (& Matt & Dave) --- This is a solver that is mixing some 
matrix-based stuff implemented using PETSc along with some matrix-free stuff 
that is built on top of SAMRAI. Amneet and I should take a look at performance 
off-list first.

  Just put an PetscLogEvent() in (or several) to track that part. Plus put an 
event or two outside the SNESSolve to track the outside PETSc setup time.

  The PETSc time looks reasonable at most I can only image any optimizations we 
could do bringing it down a small percentage.

Here is a bit more info about what we are trying to do:

This is a Vanka-type MG preconditioner for a Stokes-like system on a structured 
grid. (Currently just uniform grids, but hopefully soon with AMR.) For the 
smoother, we are using damped Richardson + ASM with relatively small block 
subdomains --- e.g., all DOFs associated with 8x8 cells in 2D (~300 DOFs), or 
8x8x8 in 3D (~2500 DOFs). Unfortunately, MG iteration counts really tank when 
using smaller subdomains.

I can't remember whether we have quantified this carefully, but PCASM seems to 
bog down with smaller subdomains. A question is whether there are different 
implementation choices that could make the case of "lots of little subdomains" 
run faster. But before we get to that, Amneet and I should take a more careful 
look at overall solver performance.

(We are also starting to play around with PCFIELDSPLIT for this problem too, 
although we don't have many ideas about how to handle the Schur complement.)

Thanks,

-- Boyce



  Barry


-- Boyce



Barry

On Jan 14, 2016, at 1:26 AM, Bhalla, Amneet Pal S 
<[email protected]<mailto:[email protected]>> wrote:



On Jan 13, 2016, at 9:17 PM, Griffith, Boyce Eugene 
<[email protected]<mailto:[email protected]>> wrote:

I see one hot spot:


Here is with opt build

************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r 
-fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary: 
----------------------------------------------

./main2d on a linux-opt named aorta with 1 processor, by amneetb Thu Jan 14 
02:24:43 2016
Using Petsc Development GIT revision: v3.6.3-3098-ga3ecda2  GIT Date: 
2016-01-13 21:30:26 -0600

                      Max       Max/Min        Avg      Total
Time (sec):           1.018e+00      1.00000   1.018e+00
Objects:              2.935e+03      1.00000   2.935e+03
Flops:                4.957e+08      1.00000   4.957e+08  4.957e+08
Flops/sec:            4.868e+08      1.00000   4.868e+08  4.868e+08
MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Reductions:       0.000e+00      0.00000

Flop counting convention: 1 flop = 1 real number operation of type 
(multiply/divide/add/subtract)
                         e.g., VecAXPY() for real vectors of length N --> 2N 
flops
                         and VecAXPY() for complex vectors of length N --> 8N 
flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- 
Message Lengths --  -- Reductions --
                     Avg     %Total     Avg     %Total   counts   %Total     
Avg         %Total   counts   %Total
0:      Main Stage: 1.0183e+00 100.0%  4.9570e+08 100.0%  0.000e+00   0.0%  
0.000e+00        0.0% 0.000e+00   0.0%

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting 
output.
Phase summary info:
Count: number of times phase was executed
Time and Flops: Max - maximum over all processors
                Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
Avg. len: average message length (bytes)
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and 
PetscLogStagePop().
   %T - percent time in this phase         %F - percent flops in this phase
   %M - percent messages in this phase     %L - percent message lengths in this 
phase
   %R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all 
processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flops                            
 --- Global ---  --- Stage ---   Total
                Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct  
%T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

VecDot                 4 1.0 2.9564e-05 1.0 3.31e+04 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0 0  1120
VecDotNorm2          272 1.0 1.4565e-03 1.0 4.25e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  0  1  0  0  0   0  1 0  0  0  2920
VecMDot              624 1.0 8.4300e-03 1.0 5.29e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  1  1  0  0  0   1  1  0 0  0   627
VecNorm              565 1.0 3.8033e-03 1.0 4.38e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  0  1  0  0  0   0  1  0 0  0  1151
VecScale              86 1.0 5.5480e-04 1.0 1.55e+05 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0 0  0   279
VecCopy               28 1.0 5.2261e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0 0  0     0
VecSet             14567 1.0 1.2443e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   1  0  0 0  0     0
VecAXPY              903 1.0 4.2996e-03 1.0 6.66e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  0  1  0  0  0   0  1  0 0  0  1550
VecAYPX              225 1.0 1.2550e-03 1.0 8.55e+05 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0 0  0   681
VecAXPBYCZ            42 1.0 1.7118e-04 1.0 3.45e+05 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0 0  0  0  2014
VecWAXPY              70 1.0 1.9503e-04 1.0 2.98e+05 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0 0  0  0  1528
VecMAXPY             641 1.0 1.1136e-02 1.0 5.29e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  1  1  0  0  0   1  1 0  0  0   475
VecSwap              135 1.0 4.5896e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0 0  0     0
VecAssemblyBegin     745 1.0 4.9477e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0 0  0  0  0     0
VecAssemblyEnd       745 1.0 9.2411e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0 0  0  0  0     0
VecScatterBegin    40831 1.0 3.4502e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  3  0  0  0  0   3 0  0  0  0     0
BuildTwoSidedF       738 1.0 2.6712e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0 0  0  0     0
MatMult              513 1.0 9.1235e-02 1.0 7.75e+07 1.0 0.0e+00 0.0e+00 
0.0e+00  9 16  0  0  0   9 16  0 0  0   849
MatSolve           13568 1.0 2.3605e-01 1.0 3.45e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 23 70  0  0  0  23 70 0  0  0  1460
MatLUFactorSym        84 1.0 3.7430e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  4  0  0  0  0   4  0 0  0  0     0
MatLUFactorNum        85 1.0 3.9623e-02 1.0 4.19e+07 1.0 0.0e+00 0.0e+00 
0.0e+00  4  8  0  0  0   4  8 0  0  0  1058
MatILUFactorSym        1 1.0 3.3617e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0 0  0  0     0
MatScale               4 1.0 2.5511e-04 1.0 2.51e+05 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0 0   984
MatAssemblyBegin     108 1.0 6.3658e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0 0  0  0  0     0
MatAssemblyEnd       108 1.0 2.9490e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0 0  0  0  0     0
MatGetRow          33120 1.0 2.0157e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0   2  0 0  0  0     0
MatGetRowIJ           85 1.0 1.2145e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0 0  0  0     0
MatGetSubMatrice       4 1.0 8.4379e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   1  0 0  0  0     0
MatGetOrdering        85 1.0 7.7887e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   1  0 0  0  0     0
MatAXPY                4 1.0 4.9596e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  5  0  0  0  0   5  0  0 0  0     0
MatPtAP                4 1.0 4.4426e-02 1.0 4.99e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  4  1  0  0  0   4  1  0  0 0   112
MatPtAPSymbolic        4 1.0 2.7664e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  3  0  0  0  0   3  0 0  0  0     0
MatPtAPNumeric         4 1.0 1.6732e-02 1.0 4.99e+06 1.0 0.0e+00 0.0e+00 
0.0e+00  2  1  0  0  0   2  1 0  0  0   298
MatGetSymTrans         4 1.0 3.6621e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0 0  0  0     0
KSPGMRESOrthog        16 1.0 9.7778e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   1 0  0  0  0     0
KSPSetUp              90 1.0 5.7650e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0 0  0     0
KSPSolve               1 1.0 7.8831e-01 1.0 4.90e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 77 99  0  0  0  77 99 0  0  0   622
PCSetUp               90 1.0 9.9725e-02 1.0 4.19e+07 1.0 0.0e+00 0.0e+00 
0.0e+00 10  8  0  0  0  10  8  0 0  0   420
PCSetUpOnBlocks      112 1.0 8.7547e-02 1.0 4.19e+07 1.0 0.0e+00 0.0e+00 
0.0e+00  9  8  0  0  0   9 8  0  0  0   479
PCApply               16 1.0 7.1952e-01 1.0 4.89e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 71 99  0  0  0  71 99 0  0  0   680
SNESSolve              1 1.0 7.9225e-01 1.0 4.90e+08 1.0 0.0e+00 0.0e+00 
0.0e+00 78 99  0  0  0  78 99 0  0  0   619
SNESFunctionEval       2 1.0 3.2940e-03 1.0 4.68e+04 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0 0  0  0    14
SNESJacobianEval       1 1.0 4.7255e-04 1.0 4.26e+03 1.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0 0  0  0     9
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

           Vector   971            839     15573352     0.
   Vector Scatter   290            289       189584     0.
        Index Set  1171            823       951928     0.
IS L to G Mapping   110            109      2156656     0.
Application Order     6              6        99952     0.
          MatMFFD     1              1          776     0.
           Matrix   189            189     24083332     0.
Matrix Null Space     4              4         2432     0.
    Krylov Solver    90             90       122720     0.
  DMKSP interface     1              1          648     0.
   Preconditioner    90             90        89872     0.
             SNES     1              1         1328     0.
   SNESLineSearch     1              1          984     0.
           DMSNES     1              1          664     0.
 Distributed Mesh     2              2         9168     0.
Star Forest Bipartite Graph     4              4         3168     0.
  Discrete System     2              2         1712     0.
           Viewer     1              0            0     0.
========================================================================================================================
Average time to get PetscTime(): 9.53674e-07
#PETSc Option Table entries:
-ib_ksp_converged_reason
-ib_ksp_monitor_true_residual
-ib_snes_type ksponly
-log_summary
-stokes_ib_pc_level_0_sub_pc_factor_nonzeros_along_diagonal
-stokes_ib_pc_level_0_sub_pc_type ilu
-stokes_ib_pc_level_ksp_richardson_self_scale
-stokes_ib_pc_level_ksp_type richardson
-stokes_ib_pc_level_pc_asm_local_type additive
-stokes_ib_pc_level_sub_pc_factor_nonzeros_along_diagonal
-stokes_ib_pc_level_sub_pc_type lu
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: --CC=mpicc --CXX=mpicxx --FC=mpif90 --with-default-arch=0 
--PETSC_ARCH=linux-opt --with-debugging=0 --with-c++-support=1 --with-hypre=1 
--download-hypre=1 --with-hdf5=yes --COPTFLAGS=-O3 --CXXOPTFLAGS=-O3 
--FOPTFLAGS=-O3
-----------------------------------------
Libraries compiled on Thu Jan 14 01:29:56 2016 on aorta
Machine characteristics: Linux-3.13.0-63-generic-x86_64-with-Ubuntu-14.04-trusty
Using PETSc directory: /not_backed_up/amneetb/softwares/PETSc-BitBucket/PETSc
Using PETSc arch: linux-opt
-----------------------------------------

Using C compiler: mpicc  -fPIC  -Wall -Wwrite-strings -Wno-strict-aliasing 
-Wno-unknown-pragmas -Qunused-arguments -O3  ${COPTFLAGS} ${CFLAGS}
Using Fortran compiler: mpif90  -fPIC -Wall -Wno-unused-variable 
-ffree-line-length-0 -Wno-unused-dummy-argument -O3   ${FOPTFLAGS} ${FFLAGS}
-----------------------------------------

Using include paths: 
-I/not_backed_up/amneetb/softwares/PETSc-BitBucket/PETSc/linux-opt/include 
-I/not_backed_up/amneetb/softwares/PETSc-BitBucket/PETSc/include 
-I/not_backed_up/amneetb/softwares/PETSc-BitBucket/PETSc/include 
-I/not_backed_up/amneetb/softwares/PETSc-BitBucket/PETSc/linux-opt/include 
-I/not_backed_up/softwares/MPICH/include
-----------------------------------------

Using C linker: mpicc
Using Fortran linker: mpif90
Using libraries: 
-Wl,-rpath,/not_backed_up/amneetb/softwares/PETSc-BitBucket/PETSc/linux-opt/lib 
-L/not_backed_up/amneetb/softwares/PETSc-BitBucket/PETSc/linux-opt/lib -lpetsc 
-Wl,-rpath,/not_backed_up/amneetb/softwares/PETSc-BitBucket/PETSc/linux-opt/lib 
-L/not_backed_up/amneetb/softwares/PETSc-BitBucket/PETSc/linux-opt/lib -lHYPRE 
-Wl,-rpath,/not_backed_up/softwares/MPICH/lib 
-L/not_backed_up/softwares/MPICH/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.8 
-L/usr/lib/gcc/x86_64-linux-gnu/4.8 -Wl,-rpath,/usr/lib/x86_64-linux-gnu 
-L/usr/lib/x86_64-linux-gnu -Wl,-rpath,/lib/x86_64-linux-gnu 
-L/lib/x86_64-linux-gnu -lmpicxx -lstdc++ -llapack -lblas -lpthread 
-lhdf5hl_fortran -lhdf5_fortran -lhdf5_hl -lhdf5 -lX11 -lm -lmpifort -lgfortran 
-lm -lgfortran -lm -lquadmath -lm -lmpicxx -lstdc++ 
-Wl,-rpath,/not_backed_up/softwares/MPICH/lib 
-L/not_backed_up/softwares/MPICH/lib 
-Wl,-rpath,/usr/lib/gcc/x86_64-linux-gnu/4.8 
-L/usr/lib/gcc/x86_64-linux-gnu/4.8 -Wl,-rpath,/usr/lib/x86_64-linux-gnu 
-L/usr/lib/x86_64-linux-gnu -Wl,-rpath,/lib/x86_64-linux-gnu 
-L/lib/x86_64-linux-gnu -Wl,-rpath,/usr/lib/x86_64-linux-gnu 
-L/usr/lib/x86_64-linux-gnu -ldl -Wl,-rpath,/not_backed_up/softwares/MPICH/lib 
-lmpi -lgcc_s -ldl
-----------------------------------------

Re: [petsc-users] HPCToolKit/HPCViewer on OS X

Reply via email to