Hello,

I have been using PETSc for a few months now, and it truly is fantastic piece of software.

In my particular example I am working with a large, sparse distributed (MPI AIJ) matrix we can refer as 'G'. G is a horizontal - retangular matrix (for example, 1,1 Million rows per 2,1 Million columns). This matrix is commonly very sparse and not diagonal 'heavy' (for example 5,2 Million nnz in which ~50% are on the diagonal block of MPI AIJ representation). To work with this matrix, I also have a few parallel vectors (created using MatCreate Vec), we can refer as 'm' and 'k'. I am trying to parallelize an iterative algorithm in which the most computational heavy operations are:

->Matrix-Vector Multiplication, more precisely G * m + k = b (MatMultAdd). From what I have been reading, to achive a good speedup in this operation, G should be as much diagonal as possible, due to overlapping communication and computation. But even when using a G matrix in which the diagonal block has ~95% of the nnz, I cannot get a decent speedup. Most of the times, the performance even gets worse.

->Matrix-Matrix Multiplication, in this case I need to perform G * G' = A, where A is later used on the linear solver and G' is transpose of G. The speedup in this operation is not worse, although is not very good.

->Linear problem solving. Lastly, In this operation I compute "Ax=b" from the last two operations. I tried to apply a RCM permutation to A to make it more diagonal, for better performance. However, the problem I faced was that, the permutation is performed locally in each processor and thus, the final result is different with different number of processors. I assume this was intended to reduce communication. The solution I found was
1-calculate A
2-calculate, localy to 1 machine, the RCM permutation IS using A
3-apply this permutation to the lines of G.
This works well, and A is generated as if RCM permuted. It is fine to do this operation in one machine because it is only done once while reading the input. The nnz of G become more spread and less diagonal, causing problems when calculating G * m + k = b.

These 3 operations (except the permutation) are performed in each iteration of my algorithm.

So, my questions are.
-What are the characteristics of G that lead to a good speedup in the operations I described? Am I missing something and too much obsessed with the diagonal block?

-Is there a better way to permute A without permute G and still get the same result using 1 or N machines?


I have been avoiding asking for help for a while. I'm very sorry for the long email.
Thank you very much for your time.
Best Regards,
Nelson

Reply via email to