Hi Jed, Moving to the optimized version of Petsc (without debugging) basically removed the issue. Thanks a lot!
Benjamin Op 30 mei 2012, om 13:43 heeft Jed Brown het volgende geschreven: > On Wed, May 30, 2012 at 2:23 AM, Benjamin Sanderse <B.Sanderse at cwi.nl> > wrote: > Sorry for forgetting -log_summary. Attached are log_summary for 1 and 2 > processors, for both a problem with about 1000 unknowns and one with 125000 > unknowns. The summary is for a run of the entire code, which involves many > MatMults. I hope this still provides insight on what is going on. > As you can see there is an extraordinary use of MatGetRow - I am working to > change this - but they should not influence speed of the MatMults. Any > thoughts? > > 1. What computer is this running on? Specifically, how is its memory > hierarchy laid out? > http://www.mcs.anl.gov/petsc/documentation/faq.html#computers Can you run > the benchmarks in src/benchmarks/streams/? > > 2. It's worth heeding this message, the performance will look significantly > different. If the parallel version is still much slower, please send that > -log_summary. > > ########################################################## > # # > # WARNING!!! # > # # > # This code was compiled with a debugging option, # > # To get timing results run ./configure # > # using --with-debugging=no, the performance will # > # be generally two or three times faster. # > # # > ########################################################## > > > > Benjamin > > ----- Original Message ----- > From: "Jed Brown" <jedbrown at mcs.anl.gov> > To: "PETSc users list" <petsc-users at mcs.anl.gov> > Sent: Tuesday, May 29, 2012 5:56:51 PM > Subject: Re: [petsc-users] MatMult > > On Tue, May 29, 2012 at 10:52 AM, Benjamin Sanderse <B.Sanderse at > cwi.nl>wrote: > > > Hello all, > > > > I have a simple question about using MatMult (or MatMultAdd) in parallel. > > > > I am performing the matrix-vector multiplication > > > > z = A*x + y > > > > in my code by using > > > > call MatMultAdd(A,x,y,z,ierr); CHKERRQ(ierr) > > > > A is a sparse matrix, type MPIAIJ, and x, y, and z have been obtained using > > > > call MatGetVecs(A,x,y,ierr); CHKERRQ(ierr) > > call MatGetVecs(A,PETSC_NULL_OBJECT,z,ierr); CHKERRQ(ierr) > > > > x, y, and z are vecs of type mpi. > > > > The problem is that in the sequential case the MatMultAdd is MUCH faster > > than in the parallel case (at least a factor 100 difference). > > > > 1. Send output of -log_summary > > 2. This matrix is tiny (1000x1000) and very sparse (at most 2 nonzeros per > row) so you should not expect speedup from running in parallel. > > > > > > As an example, here is the output with some properties of A when using > > -mat_view_info and -info: > > > > 2 processors: > > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 > > -2080374781 > > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 > > -2080374780 > > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 > > -2080374782 > > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 > > -2080374782 > > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 > > -2080374780 > > [0] MatStashScatterBegin_Private(): No of messages: 0 > > [1] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs. > > [0] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs. > > [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 450; storage space: 100 > > unneeded,900 used > > [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0 > > [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 450; storage space: 100 > > unneeded,900 used > > [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0 > > [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 2 > > [0] Mat_CheckInode(): Found 500 nodes out of 500 rows. Not using Inode > > routines > > [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 2 > > [1] Mat_CheckInode(): Found 500 nodes out of 500 rows. Not using Inode > > routines > > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 > > -2080374780 > > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 > > -2080374782 > > [0] MatSetUpMultiply_MPIAIJ(): Using block index set to define scatter > > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 > > -2080374780 > > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 > > -2080374782 > > [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter > > [0] VecScatterCreate(): General case: MPI to Seq > > [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 0; storage space: 0 > > unneeded,0 used > > [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0 > > [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 500 X 0; storage space: 0 > > unneeded,0 used > > [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0 > > [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 0 > > [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 0 > > Matrix Object: 2 MPI processes > > type: mpiaij > > rows=1000, cols=900 > > total: nonzeros=1800, allocated nonzeros=2000 > > total number of mallocs used during MatSetValues calls =0 > > not using I-node (on process 0) routines > > > > 1 processor: > > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 > > -2080374783 > > [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 1000 X 900; storage space: 200 > > unneeded,1800 used > > [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 0 > > [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 2 > > [0] Mat_CheckInode(): Found 1000 nodes out of 1000 rows. Not using Inode > > routines > > Matrix Object: 1 MPI processes > > type: seqaij > > rows=1000, cols=900 > > total: nonzeros=1800, allocated nonzeros=2000 > > total number of mallocs used during MatSetValues calls =0 > > not using I-node routines > > > > When I look at the partitioning of the vectors, I have the following for > > the parallel case: > > x: > > 0 450 > > 450 900 > > y: > > 0 500 > > 500 1000 > > z: > > 0 500 > > 500 1000 > > > > This seems OK to me. > > > > Certainly I am missing something in performing this matrix-vector > > multiplication efficiently. Any ideas? > > > > Best regards, > > > > Benjamin > > > -- Ir. B. Sanderse Centrum Wiskunde en Informatica Science Park 123 1098 XG Amsterdam t: +31 20 592 4161 e: sanderse at cwi.nl -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20120530/be2bfcfc/attachment-0001.html>
