Re: [petsc-users] Scalability issue

Nelson Filipe Lopes da Silva Thu, 20 Aug 2015 04:31:42 -0700

Hello.

I am sorry for the long time without response. I decided to rewrite myapplication in a different way and will send the log_summary output whendone reimplementing.

As for the machine, I am using mpirun to run jobs in a 8 node cluster.I modified the makefile on the steams folder so it would run using myhostfile.The output is attached to this email. It seems reasonable for a clusterwith 8 machines. From "lscpu", each machine cpu has 4 cores and 1socket.


Cheers,
Nelson


Em 2015-07-24 16:50, Barry Smith escreveu:

It would be very helpful if you ran the code on say 1, 2, 4, 8, 16
... processes with the option -log_summary and send (as attachments)
the log summary information.

   Also on the same machine run the streams benchmark; with recent
releases of PETSc you only need to do

cd $PETSC_DIR
make streams NPMAX=16 (or whatever your largest process count is)

and send the output.

I suspect that you are doing everything fine and it is more an issue
with the configuration of your machine. Also read the information at
http://www.mcs.anl.gov/petsc/documentation/faq.html#computers on
"binding"

  Barry
On Jul 24, 2015, at 10:41 AM, Nelson Filipe Lopes da Silva<[email protected]> wrote:
Hello,
I have been using PETSc for a few months now, and it truly isfantastic piece of software.
In my particular example I am working with a large, sparsedistributed (MPI AIJ) matrix we can refer as 'G'.G is a horizontal - retangular matrix (for example, 1,1 Million rowsper 2,1 Million columns). This matrix is commonly very sparse and notdiagonal 'heavy' (for example 5,2 Million nnz in which ~50% are on thediagonal block of MPI AIJ representation).To work with this matrix, I also have a few parallel vectors(created using MatCreate Vec), we can refer as 'm' and 'k'.I am trying to parallelize an iterative algorithm in which the mostcomputational heavy operations are:
->Matrix-Vector Multiplication, more precisely G * m + k = b(MatMultAdd). From what I have been reading, to achive a good speedupin this operation, G should be as much diagonal as possible, due tooverlapping communication and computation. But even when using a Gmatrix in which the diagonal block has ~95% of the nnz, I cannot get adecent speedup. Most of the times, the performance even gets worse.
->Matrix-Matrix Multiplication, in this case I need to perform G *G' = A, where A is later used on the linear solver and G' is transposeof G. The speedup in this operation is not worse, although is not verygood.
->Linear problem solving. Lastly, In this operation I compute "Ax=b"from the last two operations. I tried to apply a RCM permutation to Ato make it more diagonal, for better performance. However, the problemI faced was that, the permutation is performed locally in eachprocessor and thus, the final result is different with differentnumber of processors. I assume this was intended to reducecommunication. The solution I found was
1-calculate A
2-calculate, localy to 1 machine, the RCM permutation IS using A
3-apply this permutation to the lines of G.
This works well, and A is generated as if RCM permuted. It is fineto do this operation in one machine because it is only done once whilereading the input. The nnz of G become more spread and less diagonal,causing problems when calculating G * m + k = b.
These 3 operations (except the permutation) are performed in eachiteration of my algorithm.
So, my questions are.
-What are the characteristics of G that lead to a good speedup inthe operations I described? Am I missing something and too muchobsessed with the diagonal block?
-Is there a better way to permute A without permute G and still getthe same result using 1 or N machines?
I have been avoiding asking for help for a while. I'm very sorry forthe long email.
Thank you very much for your time.
Best Regards,
Nelson

cd src/benchmarks/streams; /usr/bin/gmake  --no-print-directory 
PETSC_DIR=/ffs/u/u06189/petsc-3.6.1 PETSC_ARCH=arch-linux2-c-opt streams
/ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/bin/mpicc -o MPIVersion.o -c -fPIC 
-Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -O   
-I/ffs/u/u06189/petsc-3.6.1/include 
-I/ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/include    `pwd`/MPIVersion.c
Number of MPI processes 1 Processor names  g03 
Triad:         4553.9730   Rate (MB/s) 
Number of MPI processes 2 Processor names  g03 g05 
Triad:         8889.4844   Rate (MB/s) 
Number of MPI processes 3 Processor names  g03 g05 g06 
Triad:        13226.0278   Rate (MB/s) 
Number of MPI processes 4 Processor names  g03 g05 g06 g07 
Triad:        17988.5031   Rate (MB/s) 
Number of MPI processes 5 Processor names  g03 g05 g06 g07 g08 
Triad:        22114.8242   Rate (MB/s) 
Number of MPI processes 6 Processor names  g03 g05 g06 g07 g08 g09 
Triad:        26681.4045   Rate (MB/s) 
Number of MPI processes 7 Processor names  g03 g05 g06 g07 g08 g09 g10 
Triad:        30928.1567   Rate (MB/s) 
Number of MPI processes 8 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 
Triad:        35280.7935   Rate (MB/s) 
Number of MPI processes 9 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
Triad:        20933.0419   Rate (MB/s) 
Number of MPI processes 10 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 
Triad:        23150.1922   Rate (MB/s) 
Number of MPI processes 11 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 
Triad:        25409.3204   Rate (MB/s) 
Number of MPI processes 12 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 
Triad:        27693.9999   Rate (MB/s) 
Number of MPI processes 13 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 
Triad:        29870.9387   Rate (MB/s) 
Number of MPI processes 14 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 
Triad:        32111.8358   Rate (MB/s) 
Number of MPI processes 15 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 
Triad:        34454.1477   Rate (MB/s) 
Number of MPI processes 16 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 
Triad:        36586.8003   Rate (MB/s) 
Number of MPI processes 17 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 
Triad:        26209.6191   Rate (MB/s) 
Number of MPI processes 18 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 
Triad:        27775.6575   Rate (MB/s) 
Number of MPI processes 19 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 
Triad:        29163.9278   Rate (MB/s) 
Number of MPI processes 20 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 
Triad:        30690.5290   Rate (MB/s) 
Number of MPI processes 21 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 
Triad:        32141.5457   Rate (MB/s) 
Number of MPI processes 22 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 
Triad:        33624.8884   Rate (MB/s) 
Number of MPI processes 23 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 
Triad:        35163.7506   Rate (MB/s) 
Number of MPI processes 24 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11 
Triad:        36706.8438   Rate (MB/s) 
Number of MPI processes 25 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11 g03 
Triad:        28884.2680   Rate (MB/s) 
Number of MPI processes 26 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11 g03 g05 
Triad:        29888.3408   Rate (MB/s) 
Number of MPI processes 27 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 
Triad:        30968.2941   Rate (MB/s) 
Number of MPI processes 28 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 
Triad:        32060.3097   Rate (MB/s) 
Number of MPI processes 29 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 
Triad:        33142.2832   Rate (MB/s) 
Number of MPI processes 30 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 
g09 
Triad:        34215.7163   Rate (MB/s) 
Number of MPI processes 31 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 
g09 g10 
Triad:        35377.3336   Rate (MB/s) 
Number of MPI processes 32 Processor names  g03 g05 g06 g07 g08 g09 g10 g11 g03 
g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 g09 g10 g11 g03 g05 g06 g07 g08 
g09 g10 g11 
Triad:        36624.4362   Rate (MB/s) 
------------------------------------------------
np  speedup
1 1.0
2 1.95
3 2.9
4 3.95
5 4.86
6 5.86
7 6.79
8 7.75
9 4.6
10 5.08
11 5.58
12 6.08
13 6.56
14 7.05
15 7.57
16 8.03
17 5.76
18 6.1
19 6.4
20 6.74
21 7.06
22 7.38
23 7.72
24 8.06
25 6.34
26 6.56
27 6.8
28 7.04
29 7.28
30 7.51
31 7.77
32 8.04
Estimation of possible speedup of MPI programs based on Streams benchmark.
It appears you have 8 node(s)
Unable to open matplotlib to plot speedup
Unable to open matplotlib to plot speedup

Re: [petsc-users] Scalability issue

Reply via email to