Hi Danyang,
> This does not make any difference. I have scaled up the matrix but the
performance does not change. If I run with OpenMP, the iteration number
is always the same whatever how many processors are used. This seems
quite strange as the iteration number usually increase as the number of
processors increased when run with MPI. I think I should move to the
ubuntu system to make further test, to see if this is a windows problem.
OpenMP and MPI are two different parallelization approaches:
- With MPI, we split up the system matrix into different strips, where
each of the strips is assigned to one MPI process. This then leads
(among others) to block-Jacobi preconditioner techniques, where you
usually see an increase in iteration counts. In the ex2 case, however,
this even leads to a reduction of iteration counts.
- With OpenMP, the system matrix is contiguous in memory, so one still
computes preconditioners for the full matrix (as is for example the case
with ILU). Thus, the use of OpenMP is transparent with respect to the
algorithms employed, so you don't see any change in iteration counts.
The typical vector operations like VecScale() (should) make use of
OpenMP, but apparently this is not the case. I'm double-checking on my
machine (Linux Mint Maya, based on Ubuntu 12.04 LTS) and let you know.
Best regards,
Karli
On 04/11/2013 6:51 AM, Karl Rupp wrote:
Hi,
> I have a question on the speedup of PETSc when using OpenMP. I can get
good speedup when using MPI, but no speedup when using OpenMP.
The example is ex2f with m=100 and n=100. The number of available
processors is 16 (32 threads) and the OS is Windows Server 2012. The log
files for 4 and 8 processors are attached.
The commands I used to run with 4 processors are as follows:
Run using MPI
mpiexec -n 4 Petsc-windows-ex2f.exe -m 100 -n 100 -log_summary
log_100x100_mpi_p4.log
Run using OpenMP
Petsc-windows-ex2f.exe -threadcomm_type openmp -threadcomm_nthreads 4 -m
100 -n 100 -log_summary log_100x100_openmp_p4.log
The PETSc used for this test is PETSc for Windows
http://www.mic-tc.ch/downloads/PETScForWindows.zip, but I guess this is
not the problem because the same problem exists when I use PETSc-dev in
Cygwin. I don't know if this problem exists in Linux, would anybody help
to test?
For the 100x100 case considered, the execution times per call are
somewhere in the millisecond to sub-millisecond range (e.g. 1.3ms for
68 calls to VecScale with 4 processors). I'd say this is too small in
order to see any reasonable performance gain when running multiple
threads, consider problem sizes of about 1000x1000 instead.
Moreover, keep in mind that typically you won't get a perfectly linear
scaling with the number of processor cores, because ultimately the
memory bandwidth is the limiting factor for standard vector operations.
Best regards,
Karli