Hello,
I implemented PETSc parallel linear solver in a program, the implementation is
basically the same as /src/ksp/ksp/examples/tutorials/ex2.c, i.e., I
preallocated the MatMPIAIJ, and let PETSc partition the matrix through
MatGetOwnershipRange. However, a few tests shows the parallel solver is always
a little slower the serial solver (I have excluded the matrix generation CPU).
For serial run I used PCILU as preconditioner; for parallel run, I used ASM
with ILU(0) at each subblocks (-sub_pc_type ilu -sub_ksp_type preonly -ksp_type
bcgs -pc_type asm). The number of unknowns are around 200,000.
I have used -log_summary to print out the performance summary as attached
(log_summary_p1 for serial run and log_summary_p2 for the run with 2
processes). It seems the KSPSolve counts only for less than 20% of Global %T.
My questions are:
1. what is the bottle neck of the parallel run according to the summary?
2. Do you have any suggestions to improve the parallel performance?
Thanks a lot for your suggestions!
Regards,
Qin
---------------------------------------------- PETSc Performance Summary:
----------------------------------------------
/apps/pedev/bin/MODEL_2014.00.00.9918p.exe on a arch-64bit-with-hypre-release
named hoepre1048 with 1 processor, by qlu Thu May 29 11:08:25 2014
Using Petsc Release Version 3.4.2, Jul, 02, 2013
Max Max/Min Avg Total
Time (sec): 1.060e+03 1.00000 1.060e+03
Objects: 9.453e+03 1.00000 9.453e+03
Flops: 2.498e+11 1.00000 2.498e+11 2.498e+11
Flops/sec: 2.357e+08 1.00000 2.357e+08 2.357e+08
MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00
MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00
MPI Reductions: 2.538e+04 1.00000
Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N --> 2N
flops
and VecAXPY() for complex vectors of length N -->
8N flops
Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- --
Message Lengths -- -- Reductions --
Avg %Total Avg %Total counts %Total
Avg %Total counts %Total
0: Main Stage: 1.0598e+03 100.0% 2.4983e+11 100.0% 0.000e+00 0.0%
0.000e+00 0.0% 2.538e+04 100.0%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting
output.
Phase summary info:
Count: number of times phase was executed
Time and Flops: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
Avg. len: average message length (bytes)
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and
PetscLogStagePop().
%T - percent time in this phase %f - percent flops in this phase
%M - percent messages in this phase %L - percent message lengths in
this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all
processors)
------------------------------------------------------------------------------------------------------------------------
Event Count Time (sec) Flops
--- Global --- --- Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len Reduct
%T %f %M %L %R %T %f %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
KSPSetUp 1349 1.0 2.7328e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
8.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 1349 1.0 1.7646e+02 1.0 2.44e+11 1.0 0.0e+00 0.0e+00
1.6e+04 17 97 0 0 63 17 97 0 0 63 1380
PCSetUp 1349 1.0 2.6232e+01 1.0 6.31e+09 1.0 0.0e+00 0.0e+00
4.0e+03 2 3 0 0 16 2 3 0 0 16 241
PCApply 33197 1.0 8.9208e+01 1.0 8.59e+10 1.0 0.0e+00 0.0e+00
0.0e+00 8 34 0 0 0 8 34 0 0 0 963
VecDot 31848 1.0 4.8985e+00 1.0 1.36e+10 1.0 0.0e+00 0.0e+00
0.0e+00 0 5 0 0 0 0 5 0 0 0 2768
VecDotNorm2 15924 1.0 3.1893e+00 1.0 1.36e+10 1.0 0.0e+00 0.0e+00
1.6e+04 0 5 0 0 63 0 5 0 0 63 4252
VecNorm 17273 1.0 9.3441e-01 1.0 7.36e+09 1.0 0.0e+00 0.0e+00
0.0e+00 0 3 0 0 0 0 3 0 0 0 7871
VecCopy 5396 1.0 9.6562e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 4054 1.0 5.9545e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAXPBYCZ 31848 1.0 7.8011e+00 1.0 2.71e+10 1.0 0.0e+00 0.0e+00
0.0e+00 1 11 0 0 0 1 11 0 0 0 3477
VecWAXPY 31848 1.0 6.8325e+00 1.0 1.36e+10 1.0 0.0e+00 0.0e+00
0.0e+00 1 5 0 0 0 1 5 0 0 0 1985
VecAssemblyBegin 2698 1.0 3.4332e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAssemblyEnd 2698 1.0 1.2302e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatMult 31848 1.0 6.1956e+01 1.0 8.24e+10 1.0 0.0e+00 0.0e+00
0.0e+00 6 33 0 0 0 6 33 0 0 0 1330
MatSolve 33197 1.0 8.9183e+01 1.0 8.59e+10 1.0 0.0e+00 0.0e+00
0.0e+00 8 34 0 0 0 8 34 0 0 0 963
MatLUFactorNum 1349 1.0 1.7988e+01 1.0 6.31e+09 1.0 0.0e+00 0.0e+00
0.0e+00 2 3 0 0 0 2 3 0 0 0 351
MatILUFactorSym 1349 1.0 7.5739e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
1.3e+03 1 0 0 0 5 1 0 0 0 5 0
MatAssemblyBegin 2698 1.0 2.1935e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 2698 1.0 7.8629e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 1 0 0 0 0 1 0 0 0 0 0
MatGetRowIJ 1349 1.0 2.0790e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 1349 1.0 6.1149e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
2.7e+03 0 0 0 0 11 0 0 0 0 11 0
------------------------------------------------------------------------------------------------------------------------
Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants' Mem.
Reports information only for process 0.
--- Event Stage 0: Main Stage
Krylov Solver 1 1 1152 0
Preconditioner 1 1 976 0
Vector 2705 2705 15926248 0
Matrix 2698 2698 28412036784 0
Index Set 4047 4047 1151927288 0
Viewer 1 0 0 0
========================================================================================================================
Average time to get PetscTime(): 0
#PETSc Option Table entries:
-log_summary
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 8 sizeof(PetscInt) 4
---------------------------------------------- PETSc Performance Summary:
----------------------------------------------
/apps/pedev/bin/MODEL_2014.00.00.9918p.exe on a arch-64bit-with-hypre-release
named hoepre1085 with 2 processors, by qlu Thu May 29 11:05:04 2014
Using Petsc Release Version 3.4.2, Jul, 02, 2013
Max Max/Min Avg Total
Time (sec): 1.071e+03 1.00000 1.071e+03
Objects: 2.527e+04 1.00000 2.527e+04
Flops: 1.272e+11 1.03246 1.252e+11 2.505e+11
Flops/sec: 1.189e+08 1.03246 1.170e+08 2.340e+08
MPI Messages: 7.427e+04 1.00000 7.427e+04 1.485e+05
MPI Message Lengths: 1.466e+09 1.00000 1.973e+04 2.931e+09
MPI Reductions: 1.195e+05 1.00000
Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N --> 2N
flops
and VecAXPY() for complex vectors of length N -->
8N flops
Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- --
Message Lengths -- -- Reductions --
Avg %Total Avg %Total counts %Total
Avg %Total counts %Total
0: Main Stage: 1.0705e+03 100.0% 2.5047e+11 100.0% 1.485e+05 100.0%
1.973e+04 100.0% 1.195e+05 100.0%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting
output.
Phase summary info:
Count: number of times phase was executed
Time and Flops: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
Avg. len: average message length (bytes)
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and
PetscLogStagePop().
%T - percent time in this phase %f - percent flops in this phase
%M - percent messages in this phase %L - percent message lengths in
this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all
processors)
------------------------------------------------------------------------------------------------------------------------
Event Count Time (sec) Flops
--- Global --- --- Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len Reduct
%T %f %M %L %R %T %f %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
KSPSetUp 2658 1.0 2.3267e-03 1.2 0.00e+00 0.0 0.0e+00 0.0e+00
2.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 1329 1.0 1.4337e+02 1.0 1.27e+11 1.0 1.3e+05 1.1e+04
6.9e+04 13100 87 50 58 13100 87 50 58 1747
PCSetUp 2658 1.0 5.7349e+01 1.0 3.26e+09 1.1 1.3e+04 2.5e+04
1.3e+04 5 3 9 11 11 5 3 9 11 11 110
PCSetUpOnBlocks 1329 1.0 1.4165e+01 1.0 3.26e+09 1.1 0.0e+00 0.0e+00
4.0e+03 1 3 0 0 3 1 3 0 0 3 443
PCApply 33149 1.0 6.9017e+01 1.0 4.43e+10 1.0 6.6e+04 1.1e+04
0.0e+00 6 35 45 25 0 6 35 45 25 0 1257
MatMult 31820 1.0 4.3561e+01 1.0 4.21e+10 1.0 6.4e+04 1.1e+04
0.0e+00 4 33 43 24 0 4 33 43 24 0 1891
MatSolve 33149 1.0 5.2905e+01 1.0 4.43e+10 1.0 0.0e+00 0.0e+00
0.0e+00 5 35 0 0 0 5 35 0 0 0 1639
MatLUFactorNum 1329 1.0 9.1021e+00 1.0 3.26e+09 1.1 0.0e+00 0.0e+00
0.0e+00 1 3 0 0 0 1 3 0 0 0 690
MatILUFactorSym 1329 1.0 4.7301e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
1.3e+03 0 0 0 0 1 0 0 0 0 1 0
MatAssemblyBegin 2658 1.0 2.7476e-01 4.4 0.00e+00 0.0 0.0e+00 0.0e+00
2.7e+03 0 0 0 0 2 0 0 0 0 2 0
MatAssemblyEnd 2658 1.0 1.4745e+01 1.0 0.00e+00 0.0 5.3e+03 2.8e+03
1.1e+04 1 0 4 1 9 1 0 4 1 9 0
MatGetRowIJ 1329 1.0 2.1029e-04 1.8 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetSubMatrice 1329 1.0 4.3235e+01 1.0 0.00e+00 0.0 1.3e+04 2.5e+04
9.3e+03 4 0 9 11 8 4 0 9 11 8 0
MatGetOrdering 1329 1.0 3.1114e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
2.7e+03 0 0 0 0 2 0 0 0 0 2 0
MatIncreaseOvrlp 1 1.0 4.4372e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
2.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecDot 31820 1.0 3.0673e+00 1.0 6.77e+09 1.0 0.0e+00 0.0e+00
3.2e+04 0 5 0 0 27 0 5 0 0 27 4417
VecDotNorm2 15910 1.0 1.8416e+00 1.1 6.77e+09 1.0 0.0e+00 0.0e+00
1.6e+04 0 5 0 0 13 0 5 0 0 13 7357
VecNorm 17239 1.0 1.0413e+00 1.6 3.67e+09 1.0 0.0e+00 0.0e+00
1.7e+04 0 3 0 0 14 0 3 0 0 14 7049
VecCopy 5316 1.0 6.2033e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 74274 1.0 6.4018e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 1 0 0 0 0 1 0 0 0 0 0
VecAXPBYCZ 31820 1.0 6.4316e+00 1.1 1.35e+10 1.0 0.0e+00 0.0e+00
0.0e+00 1 11 0 0 0 1 11 0 0 0 4213
VecWAXPY 31820 1.0 4.5442e+00 1.1 6.77e+09 1.0 0.0e+00 0.0e+00
0.0e+00 0 5 0 0 0 0 5 0 0 0 2982
VecAssemblyBegin 2658 1.0 9.4526e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
8.0e+03 0 0 0 0 7 0 0 0 0 7 0
VecAssemblyEnd 2658 1.0 9.0122e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecScatterBegin 99447 1.0 1.0797e+01 1.0 0.00e+00 0.0 1.3e+05 2.0e+04
1.3e+03 1 0 87 88 1 1 0 87 88 1 0
VecScatterEnd 98118 1.0 3.7106e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
------------------------------------------------------------------------------------------------------------------------
Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants' Mem.
Reports information only for process 0.
--- Event Stage 0: Main Stage
Krylov Solver 2 2 2296 0
Preconditioner 2 2 1896 0
Matrix 6645 6645 254548117932 0
Vector 7984 6655 2297219696 0
Vector Scatter 2659 2659 2244404 0
Index Set 7977 7977 587320100 0
Viewer 1 0 0 0
========================================================================================================================
Average time to get PetscTime(): 9.53674e-08
Average time for MPI_Barrier(): 4.29153e-07
Average time for zero size MPI_Send(): 4.76837e-07
#PETSc Option Table entries:
-ksp_type bcgs
-log_summary
-pc_asm_type restrict
-pc_type asm
-sub_ksp_type preonly
-sub_pc_type ilu
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 8 sizeof(PetscInt) 4