Hello.

I am sorry for the long time without response. I decided to rewrite my application in a different way and will send the log_summary output when done reimplementing.

As for the machine, I am using mpirun to run jobs in a 8 node cluster. I modified the makefile on the steams folder so it would run using my hostfile. The output is attached to this email. It seems reasonable for a cluster with 8 machines. From "lscpu", each machine cpu has 4 cores and 1 socket.

Cheers,
Nelson


Em 2015-07-24 16:50, Barry Smith escreveu:
It would be very helpful if you ran the code on say 1, 2, 4, 8, 16
... processes with the option -log_summary and send (as attachments)
the log summary information.

   Also on the same machine run the streams benchmark; with recent
releases of PETSc you only need to do

cd $PETSC_DIR
make streams NPMAX=16 (or whatever your largest process count is)

and send the output.

I suspect that you are doing everything fine and it is more an issue
with the configuration of your machine. Also read the information at
http://www.mcs.anl.gov/petsc/documentation/faq.html#computers on
"binding"

  Barry

On Jul 24, 2015, at 10:41 AM, Nelson Filipe Lopes da Silva <[email protected]> wrote:

Hello,

I have been using PETSc for a few months now, and it truly is fantastic piece of software.

In my particular example I am working with a large, sparse distributed (MPI AIJ) matrix we can refer as 'G'. G is a horizontal - retangular matrix (for example, 1,1 Million rows per 2,1 Million columns). This matrix is commonly very sparse and not diagonal 'heavy' (for example 5,2 Million nnz in which ~50% are on the diagonal block of MPI AIJ representation). To work with this matrix, I also have a few parallel vectors (created using MatCreate Vec), we can refer as 'm' and 'k'. I am trying to parallelize an iterative algorithm in which the most computational heavy operations are:

->Matrix-Vector Multiplication, more precisely G * m + k = b (MatMultAdd). From what I have been reading, to achive a good speedup in this operation, G should be as much diagonal as possible, due to overlapping communication and computation. But even when using a G matrix in which the diagonal block has ~95% of the nnz, I cannot get a decent speedup. Most of the times, the performance even gets worse.

->Matrix-Matrix Multiplication, in this case I need to perform G * G' = A, where A is later used on the linear solver and G' is transpose of G. The speedup in this operation is not worse, although is not very good.

->Linear problem solving. Lastly, In this operation I compute "Ax=b" from the last two operations. I tried to apply a RCM permutation to A to make it more diagonal, for better performance. However, the problem I faced was that, the permutation is performed locally in each processor and thus, the final result is different with different number of processors. I assume this was intended to reduce communication. The solution I found was
1-calculate A
2-calculate, localy to 1 machine, the RCM permutation IS using A
3-apply this permutation to the lines of G.
This works well, and A is generated as if RCM permuted. It is fine to do this operation in one machine because it is only done once while reading the input. The nnz of G become more spread and less diagonal, causing problems when calculating G * m + k = b.

These 3 operations (except the permutation) are performed in each iteration of my algorithm.

So, my questions are.
-What are the characteristics of G that lead to a good speedup in the operations I described? Am I missing something and too much obsessed with the diagonal block?

-Is there a better way to permute A without permute G and still get the same result using 1 or N machines?


I have been avoiding asking for help for a while. I'm very sorry for the long email.
Thank you very much for your time.
Best Regards,
Nelson
cd src/benchmarks/streams; /usr/bin/gmake  --no-print-directory 
PETSC_DIR=/ffs/u/u06189/petsc-3.6.1 PETSC_ARCH=arch-linux2-c-opt streams
/ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/bin/mpicc -o MPIVersion.o -c -fPIC 
-Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -O   
-I/ffs/u/u06189/petsc-3.6.1/include 
-I/ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/include    `pwd`/MPIVersion.c
Number of MPI processes 1 Processor names  g03 
Triad:         4553.9730   Rate (MB/s) 
Number of MPI processes 2 Processor names  g03 g05 
Triad:         8889.4844   Rate (MB/s) 
Number of MPI processes 3 Processor names  g03 g05 g06 
Triad:        13226.0278   Rate (MB/s) 
Number of MPI processes 4 Processor names  g03 g05 g06 g07 
Triad:        17988.5031   Rate (MB/s) 
Number of MPI processes 5 Processor names  g03 g05 g06 g07 g08 
Triad:        22114.8242   Rate (MB/s) 
Number of MPI processes 6 Processor names  g03 g05 g06 g07 g08 g09 
Triad:        26681.4045   Rate (MB/s) 
Number of MPI processes 7 Processor names  g03 g05 g06 g07 g08 g09 g10 
Triad:        30928.1567   Rate (MB/s) 
Number of MPI processes 8 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 
Triad:        35280.7935   Rate (MB/s) 
Number of MPI processes 9 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
Triad:        20933.0419   Rate (MB/s) 
Number of MPI processes 10 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 
Triad:        23150.1922   Rate (MB/s) 
Number of MPI processes 11 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 
Triad:        25409.3204   Rate (MB/s) 
Number of MPI processes 12 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 
Triad:        27693.9999   Rate (MB/s) 
Number of MPI processes 13 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 
Triad:        29870.9387   Rate (MB/s) 
Number of MPI processes 14 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 
Triad:        32111.8358   Rate (MB/s) 
Number of MPI processes 15 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 
Triad:        34454.1477   Rate (MB/s) 
Number of MPI processes 16 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 
Triad:        36586.8003   Rate (MB/s) 
Number of MPI processes 17 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 
Triad:        26209.6191   Rate (MB/s) 
Number of MPI processes 18 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 
Triad:        27775.6575   Rate (MB/s) 
Number of MPI processes 19 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 
Triad:        29163.9278   Rate (MB/s) 
Number of MPI processes 20 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 
Triad:        30690.5290   Rate (MB/s) 
Number of MPI processes 21 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 
Triad:        32141.5457   Rate (MB/s) 
Number of MPI processes 22 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 
Triad:        33624.8884   Rate (MB/s) 
Number of MPI processes 23 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 
Triad:        35163.7506   Rate (MB/s) 
Number of MPI processes 24 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11 
Triad:        36706.8438   Rate (MB/s) 
Number of MPI processes 25 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11 g03 
Triad:        28884.2680   Rate (MB/s) 
Number of MPI processes 26 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11 g03 g05 
Triad:        29888.3408   Rate (MB/s) 
Number of MPI processes 27 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 
Triad:        30968.2941   Rate (MB/s) 
Number of MPI processes 28 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 
Triad:        32060.3097   Rate (MB/s) 
Number of MPI processes 29 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 
Triad:        33142.2832   Rate (MB/s) 
Number of MPI processes 30 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 
g09 
Triad:        34215.7163   Rate (MB/s) 
Number of MPI processes 31 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 
g09 g10 
Triad:        35377.3336   Rate (MB/s) 
Number of MPI processes 32 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 
g09 g10 g11 
Triad:        36624.4362   Rate (MB/s) 
------------------------------------------------
np  speedup
1 1.0
2 1.95
3 2.9
4 3.95
5 4.86
6 5.86
7 6.79
8 7.75
9 4.6
10 5.08
11 5.58
12 6.08
13 6.56
14 7.05
15 7.57
16 8.03
17 5.76
18 6.1
19 6.4
20 6.74
21 7.06
22 7.38
23 7.72
24 8.06
25 6.34
26 6.56
27 6.8
28 7.04
29 7.28
30 7.51
31 7.77
32 8.04
Estimation of possible speedup of MPI programs based on Streams benchmark.
It appears you have 8 node(s)
Unable to open matplotlib to plot speedup
Unable to open matplotlib to plot speedup

Reply via email to