Hello. Thank you very much for your time. I understood the idea, it works very well.I also noticed that my algorithm performs a different number of iterations with different number of machines. The stop conditions are calculated using PETSc "matmultadd". I'm very positive that there may be a program bug in my code, or could it be something with PETSc? I also need to figure out why those vecmax ratio are so high. The vecset is understandable as I'm distributing the initial information from the root machine in sequencial.
These are the new values: 1 machine [0] Matrix diagonal_nnz:16800000 (100.00 %)[0] Matrix local nnz: 16800000 (100.00 %), local rows: 800000 (100.00 %)
ExecTime: 4min47sec Iterations: 236 2 machines [0] Matrix diagonal_nnz:8000000 (95.24 %) [1] Matrix diagonal_nnz:7600000 (90.48 %) [0] Matrix local nnz: 8400000 (50.00 %), local rows: 400000 (50.00 %) [1] Matrix local nnz: 8400000 (50.00 %), local rows: 400000 (50.00 %) ExecTime: 5min26sec Iterations: 330 3 machines [0] Matrix diagonal_nnz:5333340 (95.24 %) [1] Matrix diagonal_nnz:4800012 (85.71 %) [2] Matrix diagonal_nnz:4533332 (80.95 %) [0] Matrix local nnz: 5600007 (33.33 %), local rows: 266667 (33.33 %) [1] Matrix local nnz: 5600007 (33.33 %), local rows: 266667 (33.33 %) [2] Matrix local nnz: 5599986 (33.33 %), local rows: 266666 (33.33 %)) ExecTime: 5min25sec Iterations: 346The suggested permutation worked very well in comparison with the original matrix structure. The no-speedup may be related with the different number of iterations.
Once again, thank you very much for the time. Cheers, Nelson Em 2015-08-23 20:19, Barry Smith escreveu:
A suggestion: take your second ordering and now interlace the second half of the rows with the first half of the rows (keeping the some column ordering) That is, order the rows 0, n/2, 1, n/2+1, 2, n/2+2 etc this will take the two separate "diagonal" bands and form a single "diagonal band". This will increase the "diagonal block weight" to be pretty high and the only scatters will need to be forthe final rows of the input vector that all processes need to do theirpart of the multiply. Generate the image to make sure what I suggestmake sense and then run this ordering with 1, 2, and 3 processes. Sendthe logs. BarryOn Aug 23, 2015, at 10:12 AM, Nelson Filipe Lopes da Silva <[email protected]> wrote:Thank you for the fast response!Yes. The last rows of the matrix are indeed more dense, compared with the remaining ones. For this example, concerning load balance between machines, the last process had 46% of the matrix nonzero entries. A few weeks ago I suspected of this problem and wrote a little function that could permute the matrix rows based on their number of nonzeros. However, the matrix would become less pleasant regarding "diagonal block weight", and I stop using it as i thought I was becoming worse.Also, due to this problem, I thought I could have a complete vector copy in each processor, instead of a distributed vector. I tried to implement this idea, but had no luck with the results. However, even if this solution would work, the communication for vector update was inevitable once each iteration of my algorithm. Since this is a rectangular matrix, I cannot apply RCM or such permutations, however I can permute rows and columns though.More specifically, the problem I'm trying to solve is one of balance the best guess and uncertainty estimates of a set of Input-Output subject to linear constraints and ancillary informations. The matrix is called an aggregation matrix, and each entry can be 1, 0 or -1. I don't know the cause of its nonzero structure. I'm addressing this problem using a weighted least-squares algorithm.I ran the code with a different, more friendly problem topology, logging the load of nonzero entries and the "diagonal load" per processor. I'm sending images of both matrices nonzero structure. The last email example used matrix1, the example in this email uses matrix2. Matrix1 (last email example) is 1.098.939 rows x 2.039.681 columns and 5.171.901 nnz. The matrix2 (this email example) is 800.000 rows x 8.800.000 columns and 16.800.000 nnz.With 1,2,3 machines, I have these distributions of nonzeros (using matrix2). I'm sending the logs in this email.1 machine [0] Matrix diagonal_nnz:16800000 (100.00 %)[0] Matrix local nnz: 16800000 (100.00 %), local rows: 800000 (100.00 %)ExecTime: 4min47sec 2 machines [0] Matrix diagonal_nnz:4400000 (52.38 %) [1] Matrix diagonal_nnz:4000000 (47.62 %)[0] Matrix local nnz: 8400000 (50.00 %), local rows: 400000 (50.00 %) [1] Matrix local nnz: 8400000 (50.00 %), local rows: 400000 (50.00 %)ExecTime: 13min23sec 3 machines [0] Matrix diagonal_nnz:2933334 (52.38 %) [1] Matrix diagonal_nnz:533327 (9.52 %) [2] Matrix diagonal_nnz:2399999 (42.86 %)[0] Matrix local nnz: 5600007 (33.33 %), local rows: 266667 (33.33 %) [1] Matrix local nnz: 5600007 (33.33 %), local rows: 266667 (33.33 %) [2] Matrix local nnz: 5599986 (33.33 %), local rows: 266666 (33.33 %)ExecTime: 20min26secAs for the network, I ran the make streams NPMAX=3 again. I'm also sending it in this email.I too think that these bad results are caused by a combination of bad matrix structure, especially the "diagonal weight", and maybe network.I really should find a way to permute these matrices to a more friendly structure.Thank you very much for the help. Nelson Em 2015-08-22 22:49, Barry Smith escreveu:On Aug 22, 2015, at 4:17 PM, Nelson Filipe Lopes da Silva <[email protected]> wrote:Hi.I managed to finish the re-implementation. I ran the program with 1,2,3,4,5,6 machines and saved the summary. I send each of them in this email. In these executions, the program performs Matrix-Vector (MatMult, MatMultAdd) products and Vector-Vector operations. From what I understand while reading the logs, the program takes most of the time in "VecScatterEnd". In this example, the matrix taking part on the Matrix-Vector products is not "much diagonal heavy". The following numbers are the percentages of nnz values on the matrix diagonal block for each machine, and each execution time.NMachines %NNZ ExecTime 1 machine0 100%; 16min08sec 2 machine0 91.1%; 24min58sec machine1 69.2%; 3 machine0 90.9% 25min42sec machine1 82.8% machine2 51.6% 4 machine0 91.9% 26min27sec machine1 82.4% machine2 73.1% machine3 39.9% 5 machine0 93.2% 39min23sec machine1 82.8% machine2 74.4% machine3 64.6% machine4 31.6% 6 machine0 94.2% 54min54sec machine1 82.6% machine2 73.1% machine3 65.2% machine4 55.9% machine5 25.4%Based on this I am guessing the last rows of the matrix have a lotof nonzeros away from the diagonal? There is a big load imbalance in something: for example with 2 processes you have VecMax 10509 1.0 2.0602e+02 4.2 0.00e+00 0.0 0.0e+00 0.0e+00 1.1e+04 9 0 0 0 72 9 0 0 0 72 0 VecScatterEnd 18128 1.0 8.9404e+02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 53 0 0 0 0 53 0 0 0 0 0 MatMult 10505 1.0 6.5591e+02 1.4 3.16e+10 1.4 2.1e+04 1.2e+06 0.0e+00 37 33 58 38 0 37 33 58 38 0 83 MatMultAdd 7624 1.0 7.0028e+02 2.3 3.26e+10 2.1 1.5e+04 2.8e+06 0.0e+00 34 29 42 62 0 34 29 42 62 0 69 the 5th column has the imbalance between slowest and fastestprocess. It is 4.2 for max, 1.4 for multi and 2.3 for matmultadd, toget good speed ups these need to be much closer to 1. How many nonzeros in the matrix are there per process? Is it very different for difference processes? You really need to have each process have similar number of matrix nonzeros. Do you have apicture of the nonzero structure of the matrix? Where does the matrixcome from, why does it have this structure? Also likely there are just to many vector entries that need to be scattered to the last process for the matmults.In this implementation I'm using MatCreate and VecCreate. I'm also leaving the partition sizes in PETSC_DECIDE.Finally, to run the application, I'm using mpirun.hydra from mpich, downloaded by PETSc configure script. I'm checking the process assignment as suggested on the last email.Am I missing anything?Your network is very poor; likely ethernet. It is had to get much speedup with such slow reductions and sends and receives. Average time to get PetscTime(): 1.19209e-07 Average time for MPI_Barrier(): 0.000215769 Average time for zero size MPI_Send(): 5.94854e-05 I think you are seeing such bad results due to an unkind matrixnonzero structure giving per load balance and too much communicationand a very poor computer network that just makes all the needed communication totally dominate.Regards, Nelson Em 2015-08-20 16:17, Matthew Knepley escreveu:On Thu, Aug 20, 2015 at 6:30 AM, Nelson Filipe Lopes da Silva <[email protected]> wrote:Hello.I am sorry for the long time without response. I decided to rewrite my application in a different way and will send the log_summary output when done reimplementing.As for the machine, I am using mpirun to run jobs in a 8 node cluster. I modified the makefile on the steams folder so it would run using my hostfile. The output is attached to this email. It seems reasonable for a cluster with 8 machines. From "lscpu", each machine cpu has 4 cores and 1 socket. 1) You launcher is placing processes haphazardly. I would figure out how to assign them to certain nodes 2) Each node has enough bandwidth for 1 core, so it does not make much sense to use more than 1.Thanks, Matt Cheers, Nelson Em 2015-07-24 16:50, Barry Smith escreveu:It would be very helpful if you ran the code on say 1, 2, 4, 8, 16 ... processes with the option -log_summary and send (as attachments)the log summary information. Also on the same machine run the streams benchmark; with recent releases of PETSc you only need to do cd $PETSC_DIR make streams NPMAX=16 (or whatever your largest process count is) and send the output.I suspect that you are doing everything fine and it is more an issue with the configuration of your machine. Also read the information athttp://www.mcs.anl.gov/petsc/documentation/faq.html#computers on "binding" BarryOn Jul 24, 2015, at 10:41 AM, Nelson Filipe Lopes da Silva <[email protected]> wrote:Hello,I have been using PETSc for a few months now, and it truly is fantastic piece of software.In my particular example I am working with a large, sparse distributed (MPI AIJ) matrix we can refer as 'G'. G is a horizontal - retangular matrix (for example, 1,1 Million rows per 2,1 Million columns). This matrix is commonly very sparse and not diagonal 'heavy' (for example 5,2 Million nnz in which ~50% are on the diagonal block of MPI AIJ representation). To work with this matrix, I also have a few parallel vectors (created using MatCreate Vec), we can refer as 'm' and 'k'. I am trying to parallelize an iterative algorithm in which the most computational heavy operations are:->Matrix-Vector Multiplication, more precisely G * m + k = b (MatMultAdd). From what I have been reading, to achive a good speedup in this operation, G should be as much diagonal as possible, due to overlapping communication and computation. But even when using a G matrix in which the diagonal block has ~95% of the nnz, I cannot get a decent speedup. Most of the times, the performance even gets worse.->Matrix-Matrix Multiplication, in this case I need to perform G * G' = A, where A is later used on the linear solver and G' is transpose of G. The speedup in this operation is not worse, although is not very good.->Linear problem solving. Lastly, In this operation I compute "Ax=b" from the last two operations. I tried to apply a RCM permutation to A to make it more diagonal, for better performance. However, the problem I faced was that, the permutation is performed locally in each processor and thus, the final result is different with different number of processors. I assume this was intended to reduce communication. The solution I found was1-calculate A 2-calculate, localy to 1 machine, the RCM permutation IS using A 3-apply this permutation to the lines of G.This works well, and A is generated as if RCM permuted. It is fine to do this operation in one machine because it is only done once while reading the input. The nnz of G become more spread and less diagonal, causing problems when calculating G * m + k = b.These 3 operations (except the permutation) are performed in each iteration of my algorithm.So, my questions are.-What are the characteristics of G that lead to a good speedup in the operations I described? Am I missing something and too much obsessed with the diagonal block?-Is there a better way to permute A without permute G and still get the same result using 1 or N machines?I have been avoiding asking for help for a while. I'm very sorry for the long email.Thank you very much for your time. Best Regards, Nelson --What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead.-- Norbert Wiener<Log01P.txt><Log02P.txt><Log03P.txt><Log04P.txt><Log05P.txt><Log06P.txt><Log01P.txt><Log02P.txt><Log03P.txt><matrix1.png><matrix2.png><streams.out>
---------------------------------------------- PETSc Performance Summary:
----------------------------------------------
./bin/balance on a arch-linux2-c-opt named g03 with 1 processor, by u06189 Mon
Aug 24 12:24:40 2015
Using Petsc Release Version 3.6.1, Jul, 22, 2015
Max Max/Min Avg Total
Time (sec): 2.792e+02 1.00000 2.792e+02
Objects: 4.300e+01 1.00000 4.300e+01
Flops: 4.452e+10 1.00000 4.452e+10 4.452e+10
Flops/sec: 1.595e+08 1.00000 1.595e+08 1.595e+08
MPI Messages: 0.000e+00 0.00000 0.000e+00 0.000e+00
MPI Message Lengths: 0.000e+00 0.00000 0.000e+00 0.000e+00
MPI Reductions: 0.000e+00 0.00000
Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N --> 2N
flops
and VecAXPY() for complex vectors of length N -->
8N flops
Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- --
Message Lengths -- -- Reductions --
Avg %Total Avg %Total counts %Total
Avg %Total counts %Total
0: Main Stage: 2.7920e+02 100.0% 4.4521e+10 100.0% 0.000e+00 0.0%
0.000e+00 0.0% 0.000e+00 0.0%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting
output.
Phase summary info:
Count: number of times phase was executed
Time and Flops: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
Avg. len: average message length (bytes)
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and
PetscLogStagePop().
%T - percent time in this phase %F - percent flops in this phase
%M - percent messages in this phase %L - percent message lengths in
this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all
processors)
------------------------------------------------------------------------------------------------------------------------
Event Count Time (sec) Flops
--- Global --- --- Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len Reduct
%T %F %M %L %R %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
VecMax 475 1.0 6.1628e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 2 0 0 0 0 2 0 0 0 0 0
VecScale 1180 1.0 1.5178e+01 1.0 4.64e+09 1.0 0.0e+00 0.0e+00
0.0e+00 5 10 0 0 0 5 10 0 0 0 306
VecSet 19 1.0 3.5936e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAYPX 236 1.0 8.7991e+00 1.0 4.15e+09 1.0 0.0e+00 0.0e+00
0.0e+00 3 9 0 0 0 3 9 0 0 0 472
VecSwap 472 1.0 1.5221e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 5 0 0 0 0 5 0 0 0 0 0
VecAssemblyBegin 5 1.0 1.9073e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAssemblyEnd 5 1.0 7.1526e-07 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecPointwiseMult 474 1.0 2.2135e+01 1.0 4.16e+09 1.0 0.0e+00 0.0e+00
0.0e+00 8 9 0 0 0 8 9 0 0 0 188
VecScatterBegin 947 1.0 3.8370e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatMult 472 1.0 3.0491e+01 1.0 1.36e+10 1.0 0.0e+00 0.0e+00
0.0e+00 11 31 0 0 0 11 31 0 0 0 446
MatMultAdd 473 1.0 3.0051e+01 1.0 1.59e+10 1.0 0.0e+00 0.0e+00
0.0e+00 11 36 0 0 0 11 36 0 0 0 529
MatConvert 1 1.0 1.4064e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyBegin 4 1.0 1.0300e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 4 1.0 1.1529e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetRow 800000 1.0 1.8402e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatTranspose 1 1.0 1.9155e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 1 0 0 0 0 1 0 0 0 0 0
SFSetGraph 1 1.0 1.2875e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SFReduceBegin 1 1.0 1.1086e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SFReduceEnd 1 1.0 4.0531e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
------------------------------------------------------------------------------------------------------------------------
Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants' Mem.
Reports information only for process 0.
--- Event Stage 0: Main Stage
Vector 21 21 601632048 0
Vector Scatter 5 5 3240 0
Index Set 6 6 4608 0
Matrix 9 9 1040023568 0
Star Forest Bipartite Graph 1 1 840 0
Viewer 1 0 0 0
========================================================================================================================
Average time to get PetscTime(): 9.53674e-08
#PETSc Option Table entries:
-log_summary
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: --with-fc=0 --with-cxx=0 --with-debugging=0
--download-mpich=1 --download-f2cblaslapack=1
-----------------------------------------
Libraries compiled on Thu Jul 30 15:55:55 2015 on g03
Machine characteristics: Linux-3.16.7-21-desktop-x86_64-with-SuSE-13.2-x86_64
Using PETSc directory: /ffs/u/u06189/petsc-3.6.1
Using PETSc arch: arch-linux2-c-opt
-----------------------------------------
Using C compiler: /ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/bin/mpicc -fPIC
-Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -O
${COPTFLAGS} ${CFLAGS}
-----------------------------------------
Using include paths: -I/ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/include
-I/ffs/u/u06189/petsc-3.6.1/include -I/ffs/u/u06189/petsc-3.6.1/include
-I/ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/include
-----------------------------------------
Using C linker: /ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/bin/mpicc
Using libraries: -Wl,-rpath,/ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/lib
-L/ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/lib -lpetsc
-Wl,-rpath,/ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/lib
-L/ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/lib -lf2clapack -lf2cblas -lm
-lX11 -lpthread -lssl -lcrypto -lm -ldl
--------------------------------------------------------------------------------------- PETSc Performance Summary:
----------------------------------------------
./bin/balance on a arch-linux2-c-opt named g03 with 2 processors, by u06189 Mon
Aug 24 12:30:08 2015
Using Petsc Release Version 3.6.1, Jul, 22, 2015
Max Max/Min Avg Total
Time (sec): 3.269e+02 1.00000 3.269e+02
Objects: 4.300e+01 1.00000 4.300e+01
Flops: 3.102e+10 1.01732 3.076e+10 6.151e+10
Flops/sec: 9.488e+07 1.01732 9.407e+07 1.881e+08
MPI Messages: 1.341e+03 1.00000 1.341e+03 2.682e+03
MPI Message Lengths: 3.937e+09 1.00041 2.935e+06 7.872e+09
MPI Reductions: 1.384e+03 1.00000
Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N --> 2N
flops
and VecAXPY() for complex vectors of length N -->
8N flops
Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- --
Message Lengths -- -- Reductions --
Avg %Total Avg %Total counts %Total
Avg %Total counts %Total
0: Main Stage: 3.2694e+02 100.0% 6.1511e+10 100.0% 2.682e+03 100.0%
2.935e+06 100.0% 1.383e+03 99.9%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting
output.
Phase summary info:
Count: number of times phase was executed
Time and Flops: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
Avg. len: average message length (bytes)
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and
PetscLogStagePop().
%T - percent time in this phase %F - percent flops in this phase
%M - percent messages in this phase %L - percent message lengths in
this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all
processors)
------------------------------------------------------------------------------------------------------------------------
Event Count Time (sec) Flops
--- Global --- --- Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len Reduct
%T %F %M %L %R %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
VecMax 663 1.0 1.1879e+01 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
6.6e+02 3 0 0 0 48 3 0 0 0 48 0
VecScale 1650 1.0 8.8260e+00 1.0 2.75e+09 1.0 0.0e+00 0.0e+00
0.0e+00 3 9 0 0 0 3 9 0 0 0 623
VecSet 5 1.0 4.1822e-0218.7 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAYPX 330 1.0 6.1460e+00 1.0 2.90e+09 1.0 0.0e+00 0.0e+00
0.0e+00 2 9 0 0 0 2 9 0 0 0 945
VecSwap 660 1.0 1.0808e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 3 0 0 0 0 3 0 0 0 0 0
VecAssemblyBegin 5 1.0 3.8655e-01 1.1 0.00e+00 0.0 8.0e+00 2.0e+07
1.5e+01 0 0 0 2 1 0 0 0 2 1 0
VecAssemblyEnd 5 1.0 2.1213e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 1 0 0 0 0 1 0 0 0 0 0
VecPointwiseMult 662 1.0 1.9822e+01 1.3 2.91e+09 1.0 0.0e+00 0.0e+00
0.0e+00 5 9 0 0 0 5 9 0 0 0 293
VecScatterBegin 1323 1.0 3.2569e+00 1.2 0.00e+00 0.0 2.6e+03 2.9e+06
2.0e+00 1 0 99 96 0 1 0 99 96 0 0
VecScatterEnd 1321 1.0 8.6896e+01 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 25 0 0 0 0 25 0 0 0 0 0
MatMult 660 1.0 6.5790e+01 1.2 9.90e+09 1.1 1.3e+03 2.5e+06
0.0e+00 18 31 49 42 0 18 31 49 42 0 293
MatMultAdd 661 1.0 7.4513e+01 1.1 1.11e+10 1.0 1.3e+03 3.2e+06
0.0e+00 22 36 49 54 0 22 36 49 54 0 298
MatConvert 1 1.0 7.7143e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyBegin 4 1.0 1.9853e+00 6.8 0.00e+00 0.0 9.0e+00 1.7e+07
8.0e+00 0 0 0 2 1 0 0 0 2 1 0
MatAssemblyEnd 4 1.0 3.6846e+00 1.0 0.00e+00 0.0 8.0e+00 6.2e+05
1.6e+01 1 0 0 0 1 1 0 0 0 1 0
MatGetRow 400000 1.0 9.1753e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatTranspose 1 1.0 1.4510e+00 1.0 0.00e+00 0.0 1.5e+01 1.8e+06
1.2e+01 0 0 1 0 1 0 0 1 0 1 0
SFSetGraph 1 1.0 1.0504e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SFReduceBegin 1 1.0 4.6370e-02 1.0 0.00e+00 0.0 5.0e+00 1.3e+06
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SFReduceEnd 1 1.0 4.2277e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
------------------------------------------------------------------------------------------------------------------------
Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants' Mem.
Reports information only for process 0.
--- Event Stage 0: Main Stage
Vector 21 21 342752944 0
Vector Scatter 5 5 4488 0
Index Set 6 6 1764608 0
Matrix 9 9 528183568 0
Star Forest Bipartite Graph 1 1 840 0
Viewer 1 0 0 0
========================================================================================================================
Average time to get PetscTime(): 1.19209e-07
Average time for MPI_Barrier(): 7.30038e-05
Average time for zero size MPI_Send(): 5.05447e-05
#PETSc Option Table entries:
-log_summary
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: --with-fc=0 --with-cxx=0 --with-debugging=0
--download-mpich=1 --download-f2cblaslapack=1
-----------------------------------------
Libraries compiled on Thu Jul 30 15:55:55 2015 on g03
Machine characteristics: Linux-3.16.7-21-desktop-x86_64-with-SuSE-13.2-x86_64
Using PETSc directory: /ffs/u/u06189/petsc-3.6.1
Using PETSc arch: arch-linux2-c-opt
-----------------------------------------
Using C compiler: /ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/bin/mpicc -fPIC
-Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -O
${COPTFLAGS} ${CFLAGS}
-----------------------------------------
Using include paths: -I/ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/include
-I/ffs/u/u06189/petsc-3.6.1/include -I/ffs/u/u06189/petsc-3.6.1/include
-I/ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/include
-----------------------------------------
Using C linker: /ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/bin/mpicc
Using libraries: -Wl,-rpath,/ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/lib
-L/ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/lib -lpetsc
-Wl,-rpath,/ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/lib
-L/ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/lib -lf2clapack -lf2cblas -lm
-lX11 -lpthread -lssl -lcrypto -lm -ldl
--------------------------------------------------------------------------------------- PETSc Performance Summary:
----------------------------------------------
./bin/balance on a arch-linux2-c-opt named g03 with 3 processors, by u06189 Mon
Aug 24 12:35:35 2015
Using Petsc Release Version 3.6.1, Jul, 22, 2015
Max Max/Min Avg Total
Time (sec): 3.252e+02 1.00002 3.252e+02
Objects: 4.300e+01 1.00000 4.300e+01
Flops: 2.176e+10 1.02655 2.154e+10 6.462e+10
Flops/sec: 6.692e+07 1.02657 6.624e+07 1.987e+08
MPI Messages: 2.102e+03 1.48534 1.873e+03 5.620e+03
MPI Message Lengths: 5.380e+09 1.89077 2.362e+06 1.328e+10
MPI Reductions: 1.448e+03 1.00000
Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N --> 2N
flops
and VecAXPY() for complex vectors of length N -->
8N flops
Summary of Stages: ----- Time ------ ----- Flops ----- --- Messages --- --
Message Lengths -- -- Reductions --
Avg %Total Avg %Total counts %Total
Avg %Total counts %Total
0: Main Stage: 3.2521e+02 100.0% 6.4624e+10 100.0% 5.620e+03 100.0%
2.362e+06 100.0% 1.447e+03 99.9%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting
output.
Phase summary info:
Count: number of times phase was executed
Time and Flops: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
Avg. len: average message length (bytes)
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and
PetscLogStagePop().
%T - percent time in this phase %F - percent flops in this phase
%M - percent messages in this phase %L - percent message lengths in
this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flops over all processors)/(max time over all
processors)
------------------------------------------------------------------------------------------------------------------------
Event Count Time (sec) Flops
--- Global --- --- Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len Reduct
%T %F %M %L %R %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
VecMax 695 1.0 2.8613e+01 8.4 0.00e+00 0.0 0.0e+00 0.0e+00
7.0e+02 4 0 0 0 48 4 0 0 0 48 0
VecScale 1730 1.0 5.7747e+00 1.0 1.90e+09 1.0 0.0e+00 0.0e+00
0.0e+00 2 9 0 0 0 2 9 0 0 0 988
VecSet 5 1.0 4.2759e-0221.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAYPX 346 1.0 5.1136e+00 1.2 2.03e+09 1.0 0.0e+00 0.0e+00
0.0e+00 1 9 0 0 0 1 9 0 0 0 1191
VecSwap 692 1.0 7.5595e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 2 0 0 0 0 2 0 0 0 0 0
VecAssemblyBegin 5 1.0 1.0432e+00 2.6 0.00e+00 0.0 1.6e+01 1.4e+07
1.5e+01 0 0 0 2 1 0 0 0 2 1 0
VecAssemblyEnd 5 1.0 2.8044e+00 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 1 0 0 0 0 1 0 0 0 0 0
VecPointwiseMult 694 1.0 1.2777e+01 1.2 2.03e+09 1.0 0.0e+00 0.0e+00
0.0e+00 3 9 0 0 0 3 9 0 0 0 477
VecScatterBegin 1387 1.0 4.1250e+00 1.7 0.00e+00 0.0 5.5e+03 2.3e+06
2.0e+00 1 0 99 97 0 1 0 99 97 0 0
VecScatterEnd 1385 1.0 1.2850e+02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 35 0 0 0 0 35 0 0 0 0 0
MatMult 692 1.0 7.4612e+01 1.1 7.02e+09 1.1 2.8e+03 1.9e+06
0.0e+00 22 32 49 41 0 22 32 49 41 0 273
MatMultAdd 693 1.0 9.1343e+01 1.5 7.76e+09 1.0 2.8e+03 2.7e+06
0.0e+00 25 36 49 56 0 25 36 49 56 0 255
MatConvert 1 1.0 6.6977e-02 1.3 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyBegin 4 1.0 1.9312e+00 3.6 0.00e+00 0.0 1.8e+01 1.2e+07
8.0e+00 0 0 0 2 1 0 0 0 2 1 0
MatAssemblyEnd 4 1.0 4.1075e+00 1.0 0.00e+00 0.0 1.6e+01 4.9e+05
1.6e+01 1 0 0 0 1 1 0 0 0 1 0
MatGetRow 266667 1.0 6.2255e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatTranspose 1 1.0 1.3992e+00 1.0 0.00e+00 0.0 3.0e+01 1.6e+06
1.2e+01 0 0 1 0 1 0 0 1 0 1 0
SFSetGraph 1 1.0 1.6020e-02 2.5 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SFReduceBegin 1 1.0 5.7359e-02 1.1 0.00e+00 0.0 1.0e+01 1.1e+06
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SFReduceEnd 1 1.0 5.2850e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0
------------------------------------------------------------------------------------------------------------------------
Memory usage is given in bytes:
Object Type Creations Destructions Memory Descendants' Mem.
Reports information only for process 0.
--- Event Stage 0: Main Stage
Vector 21 21 251979776 0
Vector Scatter 5 5 4488 0
Index Set 6 6 1177996 0
Matrix 9 9 352130560 0
Star Forest Bipartite Graph 1 1 840 0
Viewer 1 0 0 0
========================================================================================================================
Average time to get PetscTime(): 1.19209e-07
Average time for MPI_Barrier(): 8.8501e-05
Average time for zero size MPI_Send(): 4.20411e-05
#PETSc Option Table entries:
-log_summary
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: --with-fc=0 --with-cxx=0 --with-debugging=0
--download-mpich=1 --download-f2cblaslapack=1
-----------------------------------------
Libraries compiled on Thu Jul 30 15:55:55 2015 on g03
Machine characteristics: Linux-3.16.7-21-desktop-x86_64-with-SuSE-13.2-x86_64
Using PETSc directory: /ffs/u/u06189/petsc-3.6.1
Using PETSc arch: arch-linux2-c-opt
-----------------------------------------
Using C compiler: /ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/bin/mpicc -fPIC
-Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -O
${COPTFLAGS} ${CFLAGS}
-----------------------------------------
Using include paths: -I/ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/include
-I/ffs/u/u06189/petsc-3.6.1/include -I/ffs/u/u06189/petsc-3.6.1/include
-I/ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/include
-----------------------------------------
Using C linker: /ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/bin/mpicc
Using libraries: -Wl,-rpath,/ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/lib
-L/ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/lib -lpetsc
-Wl,-rpath,/ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/lib
-L/ffs/u/u06189/petsc-3.6.1/arch-linux2-c-opt/lib -lf2clapack -lf2cblas -lm
-lX11 -lpthread -lssl -lcrypto -lm -ldl
-----------------------------------------
matrix-after.png
Description: Binary data
matrix-before.png
Description: Binary data
