Hi Barry,
Thanks for your reply, I checkout to the git branch
barry/2023-09-15/fix-log-pcmpi but get some errors when configuring PETSc,
below is the error message,
=============================================================================================
Configuring PETSc to compile on your system
=============================================================================================
=============================================================================================
***** WARNING *****
Found environment variable: MAKEFLAGS=s -j14 --jobserver-auth=3,5. Ignoring
it! Use
"./configure MAKEFLAGS=$MAKEFLAGS" if you really want to use this value
=============================================================================================
=============================================================================================
Running configure on SOWING; this may take several minutes
=============================================================================================
*********************************************************************************************
UNABLE to CONFIGURE with GIVEN OPTIONS (see configure.log for
details):
---------------------------------------------------------------------------------------------
Error running configure on SOWING
*********************************************************************************************
My configuration is
./configure PETSC_ARCH=config-release --with-scalar-type=complex
--with-fortran-kernels=1 --with-debugging=0 COPTFLAGS=-O3 -march=native
CXXOPTFLAGS=-O3 -march=native FOPTFLAGS=-O3 -march=native --with-cxx=g++
--download-openmpi --download-superlu --download-opencascade
--with-openblas-include=${OPENBLAS_INC} --with-openblas-lib=${OPENBLAS_LIB}
--with-threadsafety --with-log=0 --with-openmp
I didn’t have this issue when I configured PETSc using tarball release download
version. Any suggestions on this?
Thanks and regards,
Yongzhong
From: Barry Smith <[email protected]>
Date: Saturday, April 27, 2024 at 12:54 PM
To: Yongzhong Li <[email protected]>
Cc: [email protected] <[email protected]>, [email protected]
<[email protected]>
Subject: Re: [petsc-maint] Inquiry about Multithreading Capabilities in PETSc's
KSPSolver
You should use the git branch barry/2023-09-15/fix-log-pcmpi It is still
work-in-progress but much better than what is currently in the main PETSc
branch.
By default, the MPI linear solver server requires 10,000 unknowns per MPI
process, so for smaller problems, it will only run on one MPI rank and list
Sequential in your output. In general you need on the order of at least
10,000 unknowns per MPI process to get good speedup. You can control it with
-mpi_linear_solver_server_minimum_count_per_rank <number_unknowns>
Regarding the report of 1 iteration, that is fixed in the branch listed above.
Barry
On Apr 26, 2024, at 5:11 PM, Yongzhong Li <[email protected]> wrote:
Hi Barry,
Thanks, I am interested in this PCMPI solution provided by PETSc!
I tried the src/ksp/ksp/tutorials/ex1.c which is configured in CMakelists as
follows:
./configure PETSC_ARCH=config-debug --with-scalar-type=complex
--with-fortran-kernels=1 --with-debugging=0 --with-logging=0 --with-cxx=g++
--download-mpich --download-superlu --download-opencascade
--with-openblas-include=${OPENBLAS_INC} --with-openblas-lib=${OPENBLAS_LIB}
In the linux terminal, my bash script is as follows,
mpiexec -n 4 ./ex1 -mpi_linear_solver_server -mpi_linear_solver_server_view
However, I found the ouput a bit strange
Norm of error 1.23629e-15, Iterations 1
MPI linear solver server statistics:
Ranks KSPSolve()s Mats KSPs Avg. Size Avg. Its
Sequential 3 2 10 1
Norm of error 1.23629e-15, Iterations 1
MPI linear solver server statistics:
Ranks KSPSolve()s Mats KSPs Avg. Size Avg. Its
Sequential 3 2 10 1
Norm of error 1.23629e-15, Iterations 1
MPI linear solver server statistics:
Ranks KSPSolve()s Mats KSPs Avg. Size Avg. Its
Sequential 3 2 10 1
Norm of error 1.23629e-15, Iterations 1
MPI linear solver server statistics:
Ranks KSPSolve()s Mats KSPs Avg. Size Avg. Its
Sequential 3 2 10 1
It seems that mpi started four processes, but they all did the same things, and
I am confused why the ranks showed sequential. Are these supposed to be the
desired output when the mpi_linear_solver_server is turned on?
And if I run ex1 without any hypen options, I got
Norm of error 2.47258e-15, Iterations 5
It looks like the KSPSolver use 5 iterations to reach convergence, but why when
mpi_linear_solver_server is enabled, it uses 1?
I hope to get some help on these issues, thank you!
Sincerely,
Yongzhong
From: Barry Smith <[email protected]<mailto:[email protected]>>
Date: Tuesday, April 23, 2024 at 5:15 PM
To: Yongzhong Li
<[email protected]<mailto:[email protected]>>
Cc: [email protected]<mailto:[email protected]>
<[email protected]<mailto:[email protected]>>,
[email protected]<mailto:[email protected]>
<[email protected]<mailto:[email protected]>>, Piero Triverio
<[email protected]<mailto:[email protected]>>
Subject: Re: [petsc-maint] Inquiry about Multithreading Capabilities in PETSc's
KSPSolver
Yes, only the routines that can explicitly use BLAS have multi-threading.
PETSc does support using nay MPI linear solvers from a sequential (or
OpenMP) main program using the
https://urldefense.us/v3/__https://petsc.org/release/manualpages/PC/PCMPI/*pcmpi__;Iw!!G_uCfscf7eWS!eirB1uKnj4uFPCED9aOb4X-qap5UjyksRvLvCOc0fmBBwdAgkeUKqDhF8C3Eq10bLaeGN5DDRvUKFmIh7NSFmVqNOdnLr9b-E48$
construct. I am finishing up better support in the branch
barry/2023-09-15/fix-log-pcmpi
Barry
On Apr 23, 2024, at 3:59 PM, Yongzhong Li
<[email protected]<mailto:[email protected]>> wrote:
Thanks Barry! Does this mean that the sparse matrix-vector products, which
actually constitute the majority of the computations in my GMRES routine in
PETSc, don’t utilize multithreading? Only basic vector operations such as
VecAXPY and VecDot or dense matrix operations in PETSc will benefit from
multithreading, is it correct?
Best,
Yongzhong
From: Barry Smith <[email protected]<mailto:[email protected]>>
Date: Tuesday, April 23, 2024 at 3:35 PM
To: Yongzhong Li
<[email protected]<mailto:[email protected]>>
Cc: [email protected]<mailto:[email protected]>
<[email protected]<mailto:[email protected]>>,
[email protected]<mailto:[email protected]>
<[email protected]<mailto:[email protected]>>, Piero Triverio
<[email protected]<mailto:[email protected]>>
Subject: Re: [petsc-maint] Inquiry about Multithreading Capabilities in PETSc's
KSPSolver
你通常不会收到来自 [email protected]<mailto:[email protected]>
的电子邮件。了解这一点为什么很重要<https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!eirB1uKnj4uFPCED9aOb4X-qap5UjyksRvLvCOc0fmBBwdAgkeUKqDhF8C3Eq10bLaeGN5DDRvUKFmIh7NSFmVqNOdnLJKKkQH8$
>
Intel MKL or OpenBLAS are the best bet, but for vector operations they will
not be significant since the vector operations do not dominate the computations.
On Apr 23, 2024, at 3:23 PM, Yongzhong Li
<[email protected]<mailto:[email protected]>> wrote:
Hi Barry,
Thank you for the information provided!
Do you think different BLAS implementation will affect the multithreading
performance of some vectors operations in GMERS in PETSc?
I am now using OpenBLAS but didn’t see much improvement when theb
multithreading are enabled, do you think other implementation such as netlib
and intel-mkl will help?
Best,
Yongzhong
From: Barry Smith <[email protected]<mailto:[email protected]>>
Date: Monday, April 22, 2024 at 4:20 PM
To: Yongzhong Li
<[email protected]<mailto:[email protected]>>
Cc: [email protected]<mailto:[email protected]>
<[email protected]<mailto:[email protected]>>,
[email protected]<mailto:[email protected]>
<[email protected]<mailto:[email protected]>>, Piero Triverio
<[email protected]<mailto:[email protected]>>
Subject: Re: [petsc-maint] Inquiry about Multithreading Capabilities in PETSc's
KSPSolver
你通常不会收到来自 [email protected]<mailto:[email protected]>
的电子邮件。了解这一点为什么很重要<https://urldefense.us/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!G_uCfscf7eWS!eirB1uKnj4uFPCED9aOb4X-qap5UjyksRvLvCOc0fmBBwdAgkeUKqDhF8C3Eq10bLaeGN5DDRvUKFmIh7NSFmVqNOdnLJKKkQH8$
>
PETSc provided solvers do not directly use threads.
The BLAS used by LAPACK and PETSc may use threads depending on what BLAS is
being used and how it was configured.
Some of the vector operations in GMRES in PETSc use BLAS that can use
threads, including axpy, dot, etc. For sufficiently large problems, the use of
threaded BLAS can help with these routines, but not significantly for the
solver.
Dense matrix-vector products MatMult() and dense matrix direct solvers PCLU
use BLAS and thus can benefit from threading. The benefit can be significant
for large enough problems with good hardware, especially with PCLU.
If you run with -blas_view PETSc tries to indicate information about the
threading of BLAS. You can also use -blas_num_threads <n> to set the number of
threads, equivalent to setting the environmental variable. For dense solvers
you can vary the number of threads and run with -log_view to see what it helps
to improve and what it does not effect.
On Apr 22, 2024, at 4:06 PM, Yongzhong Li
<[email protected]<mailto:[email protected]>> wrote:
This Message Is From an External Sender
This message came from outside your organization.
Hello all,
I am writing to ask if PETSc’s KSPSolver makes use of OpenMP/multithreading,
specifically when performing iterative solutions with the GMRES algorithm.
The questions appeared when I was running a large numerical program based on
boundary element method. I used the PETSc's GMRES algorithm in KSPSolve to
solve a shell matrix system iteratively. I observed that threads were being
utilized, controlled by the OPENBLAS_NUM_THREADS environment variable. However,
I noticed no significant performance difference between running the solver with
multiple threads versus a single thread.
Could you please confirm if GMRES in KSPSolve leverages multithreading, and
also whether it is influenced by the multithreadings of the low-level math
libraries such as BLAS and LAPACK? If so, how can I enable multithreading
effectively to see noticeable improvements in solution times when using GMRES?
If not, why do I observe that threads are being used during the GMERS solutions?
For reference, I am using PETSc version 3.16.0, configured in CMakelists as
follows:
./configure PETSC_ARCH=config-release --with-scalar-type=complex
--with-fortran-kernels=1 --with-debugging=0 COPTFLAGS=-O3 -march=native
CXXOPTFLAGS=-O3 -march=native FOPTFLAGS=-O3 -march=native --with-cxx=g++
--download-openmpi --download-superlu --download-opencascade
--with-openblas-include=${OPENBLAS_INC} --with-openblas-lib=${OPENBLAS_LIB}
--with-threadsafety --with-log=0 --with-openmp
To simplify the diagnosis of potential issues, I have also written a small
example program using GMRES to solve a sparse matrix system derived from a 2D
Poisson problem using the finite difference method. I found similar issues on
this piece of codes. The code is as follows:
#include <petscksp.h>
/* Monitor function to print iteration number and residual norm */
PetscErrorCode MyKSPMonitor(KSP ksp, PetscInt n, PetscReal rnorm, void *ctx) {
PetscErrorCode ierr;
ierr = PetscPrintf(PETSC_COMM_WORLD, "Iteration %D, Residual norm %g\n", n,
(double)rnorm);
CHKERRQ(ierr);
return 0;
}
int main(int argc, char **args) {
Vec x, b, x_true, e;
Mat A;
KSP ksp;
PetscErrorCode ierr;
PetscInt i, j, Ii, J, n = 500; // Size of the grid n x n
PetscInt Istart, Iend, ncols;
PetscScalar v;
PetscMPIInt rank;
PetscInitialize(&argc, &args, NULL, NULL);
PetscLogDouble t1, t2; // Variables for timing
MPI_Comm_rank(PETSC_COMM_WORLD, &rank);
// Create vectors and matrix
ierr = VecCreateMPI(PETSC_COMM_WORLD, PETSC_DECIDE, n*n, &x); CHKERRQ(ierr);
ierr = VecDuplicate(x, &b); CHKERRQ(ierr);
ierr = VecDuplicate(x, &x_true); CHKERRQ(ierr);
// Set true solution as all ones
ierr = VecSet(x_true, 1.0); CHKERRQ(ierr);
// Create and assemble matrix A for the 2D Laplacian using 5-point stencil
ierr = MatCreate(PETSC_COMM_WORLD, &A); CHKERRQ(ierr);
ierr = MatSetSizes(A, PETSC_DECIDE, PETSC_DECIDE, n*n, n*n); CHKERRQ(ierr);
ierr = MatSetFromOptions(A); CHKERRQ(ierr);
ierr = MatSetUp(A); CHKERRQ(ierr);
ierr = MatGetOwnershipRange(A, &Istart, &Iend); CHKERRQ(ierr);
for (Ii = Istart; Ii < Iend; Ii++) {
i = Ii / n; // Row index
j = Ii % n; // Column index
v = -4.0;
ierr = MatSetValues(A, 1, &Ii, 1, &Ii, &v, INSERT_VALUES);
CHKERRQ(ierr);
if (i > 0) { // South
J = Ii - n;
v = 1.0;
ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES);
CHKERRQ(ierr);
}
if (i < n - 1) { // North
J = Ii + n;
v = 1.0;
ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES);
CHKERRQ(ierr);
}
if (j > 0) { // West
J = Ii - 1;
v = 1.0;
ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES);
CHKERRQ(ierr);
}
if (j < n - 1) { // East
J = Ii + 1;
v = 1.0;
ierr = MatSetValues(A, 1, &Ii, 1, &J, &v, INSERT_VALUES);
CHKERRQ(ierr);
}
}
ierr = MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);
ierr = MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY); CHKERRQ(ierr);
// Compute the RHS corresponding to the true solution
ierr = MatMult(A, x_true, b); CHKERRQ(ierr);
// Set up and solve the linear system
ierr = KSPCreate(PETSC_COMM_WORLD, &ksp); CHKERRQ(ierr);
ierr = KSPSetOperators(ksp, A, A); CHKERRQ(ierr);
ierr = KSPSetType(ksp, KSPGMRES); CHKERRQ(ierr);
ierr = KSPSetTolerances(ksp, 1e-5, PETSC_DEFAULT, PETSC_DEFAULT,
PETSC_DEFAULT); CHKERRQ(ierr);
/* Set up the monitor */
ierr = KSPMonitorSet(ksp, MyKSPMonitor, NULL, NULL); CHKERRQ(ierr);
// Start timing
PetscTime(&t1);
ierr = KSPSolve(ksp, b, x); CHKERRQ(ierr);
// Stop timing
PetscTime(&t2);
// Compute error
ierr = VecDuplicate(x, &e); CHKERRQ(ierr);
ierr = VecWAXPY(e, -1.0, x_true, x); CHKERRQ(ierr);
PetscReal norm_error, norm_true;
ierr = VecNorm(e, NORM_2, &norm_error); CHKERRQ(ierr);
ierr = VecNorm(x_true, NORM_2, &norm_true); CHKERRQ(ierr);
PetscReal relative_error = norm_error / norm_true;
if (rank == 0) { // Print only from the first MPI process
PetscPrintf(PETSC_COMM_WORLD, "Relative error ||x - x_true||_2 /
||x_true||_2: %g\n", (double)relative_error);
}
// Output the wall time taken for MatMult
PetscPrintf(PETSC_COMM_WORLD, "Time taken for KSPSolve: %f seconds\n", t2 -
t1);
// Cleanup
ierr = VecDestroy(&x); CHKERRQ(ierr);
ierr = VecDestroy(&b); CHKERRQ(ierr);
ierr = VecDestroy(&x_true); CHKERRQ(ierr);
ierr = VecDestroy(&e); CHKERRQ(ierr);
ierr = MatDestroy(&A); CHKERRQ(ierr);
ierr = KSPDestroy(&ksp); CHKERRQ(ierr);
PetscFinalize();
return 0;
}
Here are some profiling results for GMERS solution.
OPENBLAS_NUM_THREADS = 1, iteration steps = 859, solution time = 16.1
OPENBLAS_NUM_THREADS = 2, iteration steps = 859, solution time = 16.3
OPENBLAS_NUM_THREADS = 4, iteration steps = 859, solution time = 16.7
OPENBLAS_NUM_THREADS = 8, iteration steps = 859, solution time = 16.8
OPENBLAS_NUM_THREADS = 16, iteration steps = 859, solution time = 17.8
I am using one workstation with Intel® Core™ i9-11900K Processor, 8 cores, 16
threads. Note that I am not using multiple MPI processes, such as
mpirun/mpiexec, the default number of MPI processes should be 1, correct if I
am wrong.
Thank you in advance!
Sincerely,
Yongzhong
-----------------------------------------------------------
Yongzhong Li
PhD student | Electromagnetics Group
Department of Electrical & Computer Engineering
University of Toronto
https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!eirB1uKnj4uFPCED9aOb4X-qap5UjyksRvLvCOc0fmBBwdAgkeUKqDhF8C3Eq10bLaeGN5DDRvUKFmIh7NSFmVqNOdnLTC2sULA$
<https://urldefense.us/v3/__http://www.modelics.org__;!!G_uCfscf7eWS!efVv_hPkRBEhyquXer2c8sFeGrjOtTjEGicYg2niCyfT9swzjLFyf6k4XrhKElaF-cX_Q02y9KSTRNFHPlKhXMtuzaTekCWcXgw$>