Hello.
I am sorry for the long time without response. I decided to rewrite my
application in a different way and will send the log_summary output when
done reimplementing.
As for the machine, I am using mpirun to run jobs in a 8 node cluster.
I modified the makefile on the steams folder so it would run using my
hostfile.
The output is attached to this email. It seems reasonable for a cluster
with 8 machines. From "lscpu", each machine cpu has 4 cores and 1
socket.
Cheers,
Nelson
Em 2015-07-24 16:50, Barry Smith escreveu:
It would be very helpful if you ran the code on say 1, 2, 4, 8, 16
... processes with the option -log_summary and send (as attachments)
the log summary information.
Also on the same machine run the streams benchmark; with recent
releases of PETSc you only need to do
cd $PETSC_DIR
make streams NPMAX=16 (or whatever your largest process count is)
and send the output.
I suspect that you are doing everything fine and it is more an issue
with the configuration of your machine. Also read the information at
http://www.mcs.anl.gov/petsc/documentation/faq.html#computers on
"binding"
Barry
On Jul 24, 2015, at 10:41 AM, Nelson Filipe Lopes da Silva
<[email protected]> wrote:
Hello,
I have been using PETSc for a few months now, and it truly is
fantastic piece of software.
In my particular example I am working with a large, sparse
distributed (MPI AIJ) matrix we can refer as 'G'.
G is a horizontal - retangular matrix (for example, 1,1 Million rows
per 2,1 Million columns). This matrix is commonly very sparse and not
diagonal 'heavy' (for example 5,2 Million nnz in which ~50% are on the
diagonal block of MPI AIJ representation).
To work with this matrix, I also have a few parallel vectors
(created using MatCreate Vec), we can refer as 'm' and 'k'.
I am trying to parallelize an iterative algorithm in which the most
computational heavy operations are:
->Matrix-Vector Multiplication, more precisely G * m + k = b
(MatMultAdd). From what I have been reading, to achive a good speedup
in this operation, G should be as much diagonal as possible, due to
overlapping communication and computation. But even when using a G
matrix in which the diagonal block has ~95% of the nnz, I cannot get a
decent speedup. Most of the times, the performance even gets worse.
->Matrix-Matrix Multiplication, in this case I need to perform G *
G' = A, where A is later used on the linear solver and G' is transpose
of G. The speedup in this operation is not worse, although is not very
good.
->Linear problem solving. Lastly, In this operation I compute "Ax=b"
from the last two operations. I tried to apply a RCM permutation to A
to make it more diagonal, for better performance. However, the problem
I faced was that, the permutation is performed locally in each
processor and thus, the final result is different with different
number of processors. I assume this was intended to reduce
communication. The solution I found was
1-calculate A
2-calculate, localy to 1 machine, the RCM permutation IS using A
3-apply this permutation to the lines of G.
This works well, and A is generated as if RCM permuted. It is fine
to do this operation in one machine because it is only done once while
reading the input. The nnz of G become more spread and less diagonal,
causing problems when calculating G * m + k = b.
These 3 operations (except the permutation) are performed in each
iteration of my algorithm.
So, my questions are.
-What are the characteristics of G that lead to a good speedup in
the operations I described? Am I missing something and too much
obsessed with the diagonal block?
-Is there a better way to permute A without permute G and still get
the same result using 1 or N machines?
I have been avoiding asking for help for a while. I'm very sorry for
the long email.
Thank you very much for your time.
Best Regards,
Nelson
cd src/benchmarks/streams; /usr/bin/gmake --no-print-directory
PETSC_DIR=/ffs/u/u06189/petsc-3.6.1 PETSC_ARCH=arch-linux2-c-opt streams
/ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/bin/mpicc -o MPIVersion.o -c -fPIC
-Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -O
-I/ffs/u/u06189/petsc-3.6.1/include
-I/ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/include `pwd`/MPIVersion.c
Number of MPI processes 1 Processor names g03
Triad: 4553.9730 Rate (MB/s)
Number of MPI processes 2 Processor names g03 g05
Triad: 8889.4844 Rate (MB/s)
Number of MPI processes 3 Processor names g03 g05 g06
Triad: 13226.0278 Rate (MB/s)
Number of MPI processes 4 Processor names g03 g05 g06 g07
Triad: 17988.5031 Rate (MB/s)
Number of MPI processes 5 Processor names g03 g05 g06 g07 g08
Triad: 22114.8242 Rate (MB/s)
Number of MPI processes 6 Processor names g03 g05 g06 g07 g08 g09
Triad: 26681.4045 Rate (MB/s)
Number of MPI processes 7 Processor names g03 g05 g06 g07 g08 g09 g10
Triad: 30928.1567 Rate (MB/s)
Number of MPI processes 8 Processor names g03 g05 g06 g07 g08 g09 g10 g11
Triad: 35280.7935 Rate (MB/s)
Number of MPI processes 9 Processor names g03 g05 g06 g07 g08 g09 g10 g11 g03
Triad: 20933.0419 Rate (MB/s)
Number of MPI processes 10 Processor names g03 g05 g06 g07 g08 g09 g10 g11 g03
g05
Triad: 23150.1922 Rate (MB/s)
Number of MPI processes 11 Processor names g03 g05 g06 g07 g08 g09 g10 g11 g03
g05 g06
Triad: 25409.3204 Rate (MB/s)
Number of MPI processes 12 Processor names g03 g05 g06 g07 g08 g09 g10 g11 g03
g05 g06 g07
Triad: 27693.9999 Rate (MB/s)
Number of MPI processes 13 Processor names g03 g05 g06 g07 g08 g09 g10 g11 g03
g05 g06 g07 g08
Triad: 29870.9387 Rate (MB/s)
Number of MPI processes 14 Processor names g03 g05 g06 g07 g08 g09 g10 g11 g03
g05 g06 g07 g08 g09
Triad: 32111.8358 Rate (MB/s)
Number of MPI processes 15 Processor names g03 g05 g06 g07 g08 g09 g10 g11 g03
g05 g06 g07 g08 g09 g10
Triad: 34454.1477 Rate (MB/s)
Number of MPI processes 16 Processor names g03 g05 g06 g07 g08 g09 g10 g11 g03
g05 g06 g07 g08 g09 g10 g11
Triad: 36586.8003 Rate (MB/s)
Number of MPI processes 17 Processor names g03 g05 g06 g07 g08 g09 g10 g11 g03
g05 g06 g07 g08 g09 g10 g11 g03
Triad: 26209.6191 Rate (MB/s)
Number of MPI processes 18 Processor names g03 g05 g06 g07 g08 g09 g10 g11 g03
g05 g06 g07 g08 g09 g10 g11 g03 g05
Triad: 27775.6575 Rate (MB/s)
Number of MPI processes 19 Processor names g03 g05 g06 g07 g08 g09 g10 g11 g03
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06
Triad: 29163.9278 Rate (MB/s)
Number of MPI processes 20 Processor names g03 g05 g06 g07 g08 g09 g10 g11 g03
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07
Triad: 30690.5290 Rate (MB/s)
Number of MPI processes 21 Processor names g03 g05 g06 g07 g08 g09 g10 g11 g03
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08
Triad: 32141.5457 Rate (MB/s)
Number of MPI processes 22 Processor names g03 g05 g06 g07 g08 g09 g10 g11 g03
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09
Triad: 33624.8884 Rate (MB/s)
Number of MPI processes 23 Processor names g03 g05 g06 g07 g08 g09 g10 g11 g03
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10
Triad: 35163.7506 Rate (MB/s)
Number of MPI processes 24 Processor names g03 g05 g06 g07 g08 g09 g10 g11 g03
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11
Triad: 36706.8438 Rate (MB/s)
Number of MPI processes 25 Processor names g03 g05 g06 g07 g08 g09 g10 g11 g03
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11 g03
Triad: 28884.2680 Rate (MB/s)
Number of MPI processes 26 Processor names g03 g05 g06 g07 g08 g09 g10 g11 g03
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11 g03 g05
Triad: 29888.3408 Rate (MB/s)
Number of MPI processes 27 Processor names g03 g05 g06 g07 g08 g09 g10 g11 g03
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11 g03 g05 g06
Triad: 30968.2941 Rate (MB/s)
Number of MPI processes 28 Processor names g03 g05 g06 g07 g08 g09 g10 g11 g03
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07
Triad: 32060.3097 Rate (MB/s)
Number of MPI processes 29 Processor names g03 g05 g06 g07 g08 g09 g10 g11 g03
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08
Triad: 33142.2832 Rate (MB/s)
Number of MPI processes 30 Processor names g03 g05 g06 g07 g08 g09 g10 g11 g03
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08
g09
Triad: 34215.7163 Rate (MB/s)
Number of MPI processes 31 Processor names g03 g05 g06 g07 g08 g09 g10 g11 g03
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08
g09 g10
Triad: 35377.3336 Rate (MB/s)
Number of MPI processes 32 Processor names g03 g05 g06 g07 g08 g09 g10 g11 g03
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08
g09 g10 g11
Triad: 36624.4362 Rate (MB/s)
------------------------------------------------
np speedup
1 1.0
2 1.95
3 2.9
4 3.95
5 4.86
6 5.86
7 6.79
8 7.75
9 4.6
10 5.08
11 5.58
12 6.08
13 6.56
14 7.05
15 7.57
16 8.03
17 5.76
18 6.1
19 6.4
20 6.74
21 7.06
22 7.38
23 7.72
24 8.06
25 6.34
26 6.56
27 6.8
28 7.04
29 7.28
30 7.51
31 7.77
32 8.04
Estimation of possible speedup of MPI programs based on Streams benchmark.
It appears you have 8 node(s)
Unable to open matplotlib to plot speedup
Unable to open matplotlib to plot speedup