[petsc-users] Scalability issue

Nelson Filipe Lopes da Silva Fri, 24 Jul 2015 08:42:08 -0700

Hello,

I have been using PETSc for a few months now, and it truly is fantasticpiece of software.

In my particular example I am working with a large, sparse distributed(MPI AIJ) matrix we can refer as 'G'.G is a horizontal - retangular matrix (for example, 1,1 Million rowsper 2,1 Million columns). This matrix is commonly very sparse and notdiagonal 'heavy' (for example 5,2 Million nnz in which ~50% are on thediagonal block of MPI AIJ representation).To work with this matrix, I also have a few parallel vectors (createdusing MatCreate Vec), we can refer as 'm' and 'k'.I am trying to parallelize an iterative algorithm in which the mostcomputational heavy operations are:

->Matrix-Vector Multiplication, more precisely G * m + k = b(MatMultAdd). From what I have been reading, to achive a good speedup inthis operation, G should be as much diagonal as possible, due tooverlapping communication and computation. But even when using a Gmatrix in which the diagonal block has ~95% of the nnz, I cannot get adecent speedup. Most of the times, the performance even gets worse.

->Matrix-Matrix Multiplication, in this case I need to perform G * G' =A, where A is later used on the linear solver and G' is transpose of G.The speedup in this operation is not worse, although is not very good.

->Linear problem solving. Lastly, In this operation I compute "Ax=b"from the last two operations. I tried to apply a RCM permutation to A tomake it more diagonal, for better performance. However, the problem Ifaced was that, the permutation is performed locally in each processorand thus, the final result is different with different number ofprocessors. I assume this was intended to reduce communication. Thesolution I found was

1-calculate A
2-calculate, localy to 1 machine, the RCM permutation IS using A
3-apply this permutation to the lines of G.

This works well, and A is generated as if RCM permuted. It is fine todo this operation in one machine because it is only done once whilereading the input. The nnz of G become more spread and less diagonal,causing problems when calculating G * m + k = b.

These 3 operations (except the permutation) are performed in eachiteration of my algorithm.


So, my questions are.

-What are the characteristics of G that lead to a good speedup in theoperations I described? Am I missing something and too much obsessedwith the diagonal block?

-Is there a better way to permute A without permute G and still get thesame result using 1 or N machines?

I have been avoiding asking for help for a while. I'm very sorry forthe long email.

Thank you very much for your time.
Best Regards,
Nelson

[petsc-users] Scalability issue

Reply via email to