[petsc-users] About parallel performance

Qin Lu Thu, 29 May 2014 11:29:31 -0700

Hello,

I implemented PETSc parallel linear solver in a program, the implementation is 
basically the same as /src/ksp/ksp/examples/tutorials/ex2.c, i.e., I 
preallocated the MatMPIAIJ, and let PETSc partition the matrix through 
MatGetOwnershipRange. However, a few tests shows the parallel solver is always 
a little slower the serial solver (I have excluded the matrix generation CPU).


For serial run I used PCILU as preconditioner; for parallel run, I used ASM 
with ILU(0) at each subblocks (-sub_pc_type ilu -sub_ksp_type preonly -ksp_type 
bcgs -pc_type asm). The number of unknowns are around 200,000.
 
I have used -log_summary to print out the performance summary as attached 
(log_summary_p1 for serial run and log_summary_p2 for the run with 2 
processes). It seems the KSPSolve counts only for less than 20% of Global %T. 
My questions are:
 
1. what is the bottle neck of the parallel run according to the summary?
2. Do you have any suggestions to improve the parallel performance?
 
Thanks a lot for your suggestions!
 
Regards,
Qin

---------------------------------------------- PETSc Performance Summary: 
----------------------------------------------

/apps/pedev/bin/MODEL_2014.00.00.9918p.exe on a arch-64bit-with-hypre-release 
named hoepre1048 with 1 processor, by qlu Thu May 29 11:08:25 2014
Using Petsc Release Version 3.4.2, Jul, 02, 2013 

                         Max       Max/Min        Avg      Total 
Time (sec):           1.060e+03      1.00000   1.060e+03
Objects:              9.453e+03      1.00000   9.453e+03
Flops:                2.498e+11      1.00000   2.498e+11  2.498e+11
Flops/sec:            2.357e+08      1.00000   2.357e+08  2.357e+08
MPI Messages:         0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00      0.00000   0.000e+00  0.000e+00
MPI Reductions:       2.538e+04      1.00000

Flop counting convention: 1 flop = 1 real number operation of type 
(multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N 
flops
                            and VecAXPY() for complex vectors of length N --> 
8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- 
Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     
Avg         %Total   counts   %Total 
 0:      Main Stage: 1.0598e+03 100.0%  2.4983e+11 100.0%  0.000e+00   0.0%  
0.000e+00        0.0%  2.538e+04 100.0% 

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting 
output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and 
PetscLogStagePop().
      %T - percent time in this phase         %f - percent flops in this phase
      %M - percent messages in this phase     %L - percent message lengths in 
this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all 
processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flops                            
 --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct 
 %T %f %M %L %R  %T %f %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

KSPSetUp            1349 1.0 2.7328e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
8.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve            1349 1.0 1.7646e+02 1.0 2.44e+11 1.0 0.0e+00 0.0e+00 
1.6e+04 17 97  0  0 63  17 97  0  0 63  1380
PCSetUp             1349 1.0 2.6232e+01 1.0 6.31e+09 1.0 0.0e+00 0.0e+00 
4.0e+03  2  3  0  0 16   2  3  0  0 16   241
PCApply            33197 1.0 8.9208e+01 1.0 8.59e+10 1.0 0.0e+00 0.0e+00 
0.0e+00  8 34  0  0  0   8 34  0  0  0   963
VecDot             31848 1.0 4.8985e+00 1.0 1.36e+10 1.0 0.0e+00 0.0e+00 
0.0e+00  0  5  0  0  0   0  5  0  0  0  2768
VecDotNorm2        15924 1.0 3.1893e+00 1.0 1.36e+10 1.0 0.0e+00 0.0e+00 
1.6e+04  0  5  0  0 63   0  5  0  0 63  4252
VecNorm            17273 1.0 9.3441e-01 1.0 7.36e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  0  3  0  0  0   0  3  0  0  0  7871
VecCopy             5396 1.0 9.6562e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet              4054 1.0 5.9545e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPBYCZ         31848 1.0 7.8011e+00 1.0 2.71e+10 1.0 0.0e+00 0.0e+00 
0.0e+00  1 11  0  0  0   1 11  0  0  0  3477
VecWAXPY           31848 1.0 6.8325e+00 1.0 1.36e+10 1.0 0.0e+00 0.0e+00 
0.0e+00  1  5  0  0  0   1  5  0  0  0  1985
VecAssemblyBegin    2698 1.0 3.4332e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAssemblyEnd      2698 1.0 1.2302e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatMult            31848 1.0 6.1956e+01 1.0 8.24e+10 1.0 0.0e+00 0.0e+00 
0.0e+00  6 33  0  0  0   6 33  0  0  0  1330
MatSolve           33197 1.0 8.9183e+01 1.0 8.59e+10 1.0 0.0e+00 0.0e+00 
0.0e+00  8 34  0  0  0   8 34  0  0  0   963
MatLUFactorNum      1349 1.0 1.7988e+01 1.0 6.31e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  2  3  0  0  0   2  3  0  0  0   351
MatILUFactorSym     1349 1.0 7.5739e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
1.3e+03  1  0  0  0  5   1  0  0  0  5     0
MatAssemblyBegin    2698 1.0 2.1935e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd      2698 1.0 7.8629e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   1  0  0  0  0     0
MatGetRowIJ         1349 1.0 2.0790e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering      1349 1.0 6.1149e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
2.7e+03  0  0  0  0 11   0  0  0  0 11     0
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

       Krylov Solver     1              1         1152     0
      Preconditioner     1              1          976     0
              Vector  2705           2705     15926248     0
              Matrix  2698           2698  28412036784     0
           Index Set  4047           4047   1151927288     0
              Viewer     1              0            0     0
========================================================================================================================
Average time to get PetscTime(): 0
#PETSc Option Table entries:
-log_summary
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
sizeof(PetscScalar) 8 sizeof(PetscInt) 4

---------------------------------------------- PETSc Performance Summary: 
----------------------------------------------

/apps/pedev/bin/MODEL_2014.00.00.9918p.exe on a arch-64bit-with-hypre-release 
named hoepre1085 with 2 processors, by qlu Thu May 29 11:05:04 2014
Using Petsc Release Version 3.4.2, Jul, 02, 2013 

                         Max       Max/Min        Avg      Total 
Time (sec):           1.071e+03      1.00000   1.071e+03
Objects:              2.527e+04      1.00000   2.527e+04
Flops:                1.272e+11      1.03246   1.252e+11  2.505e+11
Flops/sec:            1.189e+08      1.03246   1.170e+08  2.340e+08
MPI Messages:         7.427e+04      1.00000   7.427e+04  1.485e+05
MPI Message Lengths:  1.466e+09      1.00000   1.973e+04  2.931e+09
MPI Reductions:       1.195e+05      1.00000

Flop counting convention: 1 flop = 1 real number operation of type 
(multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N 
flops
                            and VecAXPY() for complex vectors of length N --> 
8N flops

Summary of Stages:   ----- Time ------  ----- Flops -----  --- Messages ---  -- 
Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total   counts   %Total     
Avg         %Total   counts   %Total 
 0:      Main Stage: 1.0705e+03 100.0%  2.5047e+11 100.0%  1.485e+05 100.0%  
1.973e+04      100.0%  1.195e+05 100.0% 

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting 
output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flops: Max - maximum over all processors
                   Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   Avg. len: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and 
PetscLogStagePop().
      %T - percent time in this phase         %f - percent flops in this phase
      %M - percent messages in this phase     %L - percent message lengths in 
this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all 
processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flops                            
 --- Global ---  --- Stage ---   Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   Avg len Reduct 
 %T %f %M %L %R  %T %f %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

KSPSetUp            2658 1.0 2.3267e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
2.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve            1329 1.0 1.4337e+02 1.0 1.27e+11 1.0 1.3e+05 1.1e+04 
6.9e+04 13100 87 50 58  13100 87 50 58  1747
PCSetUp             2658 1.0 5.7349e+01 1.0 3.26e+09 1.1 1.3e+04 2.5e+04 
1.3e+04  5  3  9 11 11   5  3  9 11 11   110
PCSetUpOnBlocks     1329 1.0 1.4165e+01 1.0 3.26e+09 1.1 0.0e+00 0.0e+00 
4.0e+03  1  3  0  0  3   1  3  0  0  3   443
PCApply            33149 1.0 6.9017e+01 1.0 4.43e+10 1.0 6.6e+04 1.1e+04 
0.0e+00  6 35 45 25  0   6 35 45 25  0  1257
MatMult            31820 1.0 4.3561e+01 1.0 4.21e+10 1.0 6.4e+04 1.1e+04 
0.0e+00  4 33 43 24  0   4 33 43 24  0  1891
MatSolve           33149 1.0 5.2905e+01 1.0 4.43e+10 1.0 0.0e+00 0.0e+00 
0.0e+00  5 35  0  0  0   5 35  0  0  0  1639
MatLUFactorNum      1329 1.0 9.1021e+00 1.0 3.26e+09 1.1 0.0e+00 0.0e+00 
0.0e+00  1  3  0  0  0   1  3  0  0  0   690
MatILUFactorSym     1329 1.0 4.7301e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
1.3e+03  0  0  0  0  1   0  0  0  0  1     0
MatAssemblyBegin    2658 1.0 2.7476e-01 4.4 0.00e+00 0.0 0.0e+00 0.0e+00 
2.7e+03  0  0  0  0  2   0  0  0  0  2     0
MatAssemblyEnd      2658 1.0 1.4745e+01 1.0 0.00e+00 0.0 5.3e+03 2.8e+03 
1.1e+04  1  0  4  1  9   1  0  4  1  9     0
MatGetRowIJ         1329 1.0 2.1029e-04 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetSubMatrice    1329 1.0 4.3235e+01 1.0 0.00e+00 0.0 1.3e+04 2.5e+04 
9.3e+03  4  0  9 11  8   4  0  9 11  8     0
MatGetOrdering      1329 1.0 3.1114e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
2.7e+03  0  0  0  0  2   0  0  0  0  2     0
MatIncreaseOvrlp       1 1.0 4.4372e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
2.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecDot             31820 1.0 3.0673e+00 1.0 6.77e+09 1.0 0.0e+00 0.0e+00 
3.2e+04  0  5  0  0 27   0  5  0  0 27  4417
VecDotNorm2        15910 1.0 1.8416e+00 1.1 6.77e+09 1.0 0.0e+00 0.0e+00 
1.6e+04  0  5  0  0 13   0  5  0  0 13  7357
VecNorm            17239 1.0 1.0413e+00 1.6 3.67e+09 1.0 0.0e+00 0.0e+00 
1.7e+04  0  3  0  0 14   0  3  0  0 14  7049
VecCopy             5316 1.0 6.2033e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet             74274 1.0 6.4018e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   1  0  0  0  0     0
VecAXPBYCZ         31820 1.0 6.4316e+00 1.1 1.35e+10 1.0 0.0e+00 0.0e+00 
0.0e+00  1 11  0  0  0   1 11  0  0  0  4213
VecWAXPY           31820 1.0 4.5442e+00 1.1 6.77e+09 1.0 0.0e+00 0.0e+00 
0.0e+00  0  5  0  0  0   0  5  0  0  0  2982
VecAssemblyBegin    2658 1.0 9.4526e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
8.0e+03  0  0  0  0  7   0  0  0  0  7     0
VecAssemblyEnd      2658 1.0 9.0122e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecScatterBegin    99447 1.0 1.0797e+01 1.0 0.00e+00 0.0 1.3e+05 2.0e+04 
1.3e+03  1  0 87 88  1   1  0 87 88  1     0
VecScatterEnd      98118 1.0 3.7106e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

       Krylov Solver     2              2         2296     0
      Preconditioner     2              2         1896     0
              Matrix  6645           6645  254548117932     0
              Vector  7984           6655   2297219696     0
      Vector Scatter  2659           2659      2244404     0
           Index Set  7977           7977    587320100     0
              Viewer     1              0            0     0
========================================================================================================================
Average time to get PetscTime(): 9.53674e-08
Average time for MPI_Barrier(): 4.29153e-07
Average time for zero size MPI_Send(): 4.76837e-07
#PETSc Option Table entries:
-ksp_type bcgs
-log_summary
-pc_asm_type restrict
-pc_type asm
-sub_ksp_type preonly
-sub_pc_type ilu
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
sizeof(PetscScalar) 8 sizeof(PetscInt) 4

[petsc-users] About parallel performance

Reply via email to