It would be very helpful if you ran the code on say 1, 2, 4, 8, 16 ... 
processes with the option -log_summary and send (as attachments) the log 
summary information.

   Also on the same machine run the streams benchmark; with recent releases of 
PETSc you only need to do 

cd $PETSC_DIR
make streams NPMAX=16 (or whatever your largest process count is)

and send the output.

I suspect that you are doing everything fine and it is more an issue with the 
configuration of your machine. Also read the information at 
http://www.mcs.anl.gov/petsc/documentation/faq.html#computers on "binding"

  Barry

> On Jul 24, 2015, at 10:41 AM, Nelson Filipe Lopes da Silva 
> <[email protected]> wrote:
> 
> Hello,
> 
> I have been using PETSc for a few months now, and it truly is fantastic piece 
> of software.
> 
> In my particular example I am working with a large, sparse distributed (MPI 
> AIJ) matrix we can refer as 'G'.
> G is a horizontal - retangular matrix (for example, 1,1 Million rows per 2,1 
> Million columns). This matrix is commonly very sparse and not diagonal 
> 'heavy' (for example 5,2 Million nnz in which ~50% are on the diagonal block 
> of MPI AIJ representation).
> To work with this matrix, I also have a few parallel vectors (created using 
> MatCreate Vec), we can refer as 'm' and 'k'.
> I am trying to parallelize an iterative algorithm in which the most 
> computational heavy operations are:
> 
> ->Matrix-Vector Multiplication, more precisely G * m + k = b (MatMultAdd). 
> From what I have been reading, to achive a good speedup in this operation, G 
> should be as much diagonal as possible, due to overlapping communication and 
> computation. But even when using a G matrix in which the diagonal block has 
> ~95% of the nnz, I cannot get a decent speedup. Most of the times, the 
> performance even gets worse.
> 
> ->Matrix-Matrix Multiplication, in this case I need to perform G * G' = A, 
> where A is later used on the linear solver and G' is transpose of G. The 
> speedup in this operation is not worse, although is not very good.
> 
> ->Linear problem solving. Lastly, In this operation I compute "Ax=b" from the 
> last two operations. I tried to apply a RCM permutation to A to make it more 
> diagonal, for better performance. However, the problem I faced was that, the 
> permutation is performed locally in each processor and thus, the final result 
> is different with different number of processors. I assume this was intended 
> to reduce communication. The solution I found was
> 1-calculate A
> 2-calculate, localy to 1 machine, the RCM permutation IS using A
> 3-apply this permutation to the lines of G.
> This works well, and A is generated as if RCM permuted. It is fine to do this 
> operation in one machine because it is only done once while reading the 
> input. The nnz of G become more spread and less diagonal, causing problems 
> when calculating G * m + k = b.
> 
> These 3 operations (except the permutation) are performed in each iteration 
> of my algorithm.
> 
> So, my questions are.
> -What are the characteristics of G that lead to a good speedup in the 
> operations I described? Am I missing something and too much obsessed with the 
> diagonal block?
> 
> -Is there a better way to permute A without permute G and still get the same 
> result using 1 or N machines?
> 
> 
> I have been avoiding asking for help for a while. I'm very sorry for the long 
> email.
> Thank you very much for your time.
> Best Regards,
> Nelson

Reply via email to