It would be very helpful if you ran the code on say 1, 2, 4, 8, 16 ... processes with the option -log_summary and send (as attachments) the log summary information.
Also on the same machine run the streams benchmark; with recent releases of PETSc you only need to do cd $PETSC_DIR make streams NPMAX=16 (or whatever your largest process count is) and send the output. I suspect that you are doing everything fine and it is more an issue with the configuration of your machine. Also read the information at http://www.mcs.anl.gov/petsc/documentation/faq.html#computers on "binding" Barry > On Jul 24, 2015, at 10:41 AM, Nelson Filipe Lopes da Silva > <[email protected]> wrote: > > Hello, > > I have been using PETSc for a few months now, and it truly is fantastic piece > of software. > > In my particular example I am working with a large, sparse distributed (MPI > AIJ) matrix we can refer as 'G'. > G is a horizontal - retangular matrix (for example, 1,1 Million rows per 2,1 > Million columns). This matrix is commonly very sparse and not diagonal > 'heavy' (for example 5,2 Million nnz in which ~50% are on the diagonal block > of MPI AIJ representation). > To work with this matrix, I also have a few parallel vectors (created using > MatCreate Vec), we can refer as 'm' and 'k'. > I am trying to parallelize an iterative algorithm in which the most > computational heavy operations are: > > ->Matrix-Vector Multiplication, more precisely G * m + k = b (MatMultAdd). > From what I have been reading, to achive a good speedup in this operation, G > should be as much diagonal as possible, due to overlapping communication and > computation. But even when using a G matrix in which the diagonal block has > ~95% of the nnz, I cannot get a decent speedup. Most of the times, the > performance even gets worse. > > ->Matrix-Matrix Multiplication, in this case I need to perform G * G' = A, > where A is later used on the linear solver and G' is transpose of G. The > speedup in this operation is not worse, although is not very good. > > ->Linear problem solving. Lastly, In this operation I compute "Ax=b" from the > last two operations. I tried to apply a RCM permutation to A to make it more > diagonal, for better performance. However, the problem I faced was that, the > permutation is performed locally in each processor and thus, the final result > is different with different number of processors. I assume this was intended > to reduce communication. The solution I found was > 1-calculate A > 2-calculate, localy to 1 machine, the RCM permutation IS using A > 3-apply this permutation to the lines of G. > This works well, and A is generated as if RCM permuted. It is fine to do this > operation in one machine because it is only done once while reading the > input. The nnz of G become more spread and less diagonal, causing problems > when calculating G * m + k = b. > > These 3 operations (except the permutation) are performed in each iteration > of my algorithm. > > So, my questions are. > -What are the characteristics of G that lead to a good speedup in the > operations I described? Am I missing something and too much obsessed with the > diagonal block? > > -Is there a better way to permute A without permute G and still get the same > result using 1 or N machines? > > > I have been avoiding asking for help for a while. I'm very sorry for the long > email. > Thank you very much for your time. > Best Regards, > Nelson
