Happy to (though Piz Daint goes down for an extended upgrade on Oct 17 so would need to be run before then)!
On Thu, Oct 6, 2016 at 6:03 PM, Matthew Knepley <[email protected]> wrote: > On Thu, Oct 6, 2016 at 10:55 AM, Barry Smith <[email protected]> wrote: >> >> >> Matt, >> >> Thanks for this information. It sure looks like there is something >> seriously wrong with the MPI_Attr_get() on the cray for non-root process. >> Does any PETSc developer have access to such a machine? We need to write a >> test program that just calls MPI_Attr_get a bunch of times (no PETSc) to see >> if we can reproduce the problem and report it to Cray. > > > Barry, if you write it, we can give it to Patrick Sanan to run. > > Thanks, > > Matt > >> >> Barry >> >> >> >> On Oct 6, 2016, at 10:45 AM, Matthew Overholt <[email protected]> >> wrote: >> > >> > >> > Matthew and Barry, >> > >> > 1) I did a direct measurement of PetscCommDuplicate() time by tracing >> > just >> > that call (using CrayPat), and confirmed the sampling results. For 8 >> > processes (n=8), tracing counted a total of 101 calls, taking ~0 time on >> > the >> > root process but taking 11.78 seconds (6.3% of 188 total seconds) on >> > each of >> > the other 7 processes. For 16 processes (n=16, still only 1 node), >> > tracing >> > counted 102 total calls for a total of 18.42 seconds (13.2% of 139.6 >> > total >> > seconds) on every process except the root. >> > >> > 2) Copied below is a section of the log view for the first two solutions >> > for >> > n=2, which shows the same calls as for n=8. (I can send the entire log >> > files >> > if desired.) In each case I count about 44 PCD calls per process during >> > initialization and meshing, 7 calls during setup, 9 calls for the first >> > solution, then 3 calls for each subsequent solution (fixed-point >> > iteration), >> > and 3 calls to write out the solution, for 75 total. >> > >> > 3) I would expect that the administrators of this machine have >> > configured >> > PETSc appropriately. I am using their current default install, which is >> > 3.7.2. >> > >> > https://www.nersc.gov/users/software/programming-libraries/math-libraries/pe >> > tsc/ >> > >> > 4) Yes, I just gave the MUMPS time as a comparison. >> > >> > 5) As to where it is spending time, perhaps the timing results in the >> > log >> > files will be helpful. The "Solution took ..." printouts give the total >> > solution time for that iteration, the others are incremental times. (As >> > an >> > aside, I have been wondering why the solution times do not scale well >> > with >> > process count, even though that work is entirely done in parallel PETSc >> > routines.) >> > >> > Thanks, >> > Matt Overholt >> > >> > >> > ********** -log_view -info results for n=2 : the first solution and >> > subsequent fixed-point iteration *********** >> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 >> > -2080374779 >> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 >> > -2080374779 >> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374780 >> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 >> > -2080374781 >> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374782 >> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374780 >> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374782 >> > Matrix setup took 0.108 s >> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 >> > -2080374779 >> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 >> > -2080374781 >> > KSP PC setup took 0.079 s >> > [0] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs. >> > [0] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs. >> > [0] MatStashScatterBegin_Ref(): No of messages: 0 >> > [0] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs. >> > [1] MatStashScatterBegin_Ref(): No of messages: 1 >> > [1] MatStashScatterBegin_Ref(): Mesg_to: 0: size: 7713792 bytes >> > [1] MatAssemblyBegin_MPIAIJ(): Stash has 482112 entries, uses 5 mallocs. >> > [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 599214; storage >> > space: >> > 1050106 unneeded,15128672 used >> > [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >> > 0 >> > [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27 >> > [1] MatCheckCompressedRow(): Found the ratio (num_zerorows >> > 0)/(num_localrows >> > 599214) < 0.6. Do not use CompressedRow routines. >> > [1] MatSeqAIJCheckInode(): Found 599214 nodes out of 599214 rows. Not >> > using >> > Inode routines >> > [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 621594; storage >> > space: >> > 1237634 unneeded,15545404 used >> > [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >> > 0 >> > [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27 >> > [0] MatCheckCompressedRow(): Found the ratio (num_zerorows >> > 0)/(num_localrows >> > 621594) < 0.6. Do not use CompressedRow routines. >> > [0] MatSeqAIJCheckInode(): Found 621594 nodes out of 621594 rows. Not >> > using >> > Inode routines >> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374780 >> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374782 >> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374780 >> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374782 >> > [0] VecScatterCreateCommon_PtoS(): Using MPI_Alltoallv() for scatter >> > [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter >> > [0] VecScatterCreate(): General case: MPI to Seq >> > [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 15700; storage space: >> > 5257543 unneeded,136718 used >> > [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >> > 89 >> > [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 19 >> > [1] MatCheckCompressedRow(): Found the ratio (num_zerorows >> > 582718)/(num_localrows 599214) > 0.6. Use CompressedRow routines. >> > [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 16496; storage space: >> > 5464978 unneeded,136718 used >> > [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >> > 490 >> > [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 16 >> > [0] MatCheckCompressedRow(): Found the ratio (num_zerorows >> > 605894)/(num_localrows 621594) > 0.6. Use CompressedRow routines. >> > K and q SetValues took 26.426 s >> > [0] PCSetUp(): Setting up PC for first time >> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374782 >> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374780 >> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374782 >> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374780 >> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374780 >> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374780 >> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374782 >> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374782 >> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374782 >> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374782 >> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374780 >> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374780 >> > [0] VecScatterCreate(): Special case: processor zero gets entire >> > parallel >> > vector, rest get none >> > ** Max-trans not allowed because matrix is distributed >> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374780 >> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374782 >> > [0] PCSetUp(): Leaving PC with identical preconditioner since operator >> > is >> > unchanged >> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374780 >> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374782 >> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374780 >> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374782 >> > [0] VecScatterCreateCommon_PtoS(): Using MPI_Alltoallv() for scatter >> > [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter >> > [0] VecScatterCreate(): General case: Seq to MPI >> > [1] VecScatterCreate(): General case: Seq to MPI >> > Solution took 102.21 s >> > >> > NL iteration 0: delta = 32.0488 67.6279. >> > Error delta calc took 0.045 s >> > Node and Element temps update took 0.017 s >> > [0] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs. >> > [0] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs. >> > [0] MatStashScatterBegin_Ref(): No of messages: 0 >> > [0] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs. >> > [1] MatStashScatterBegin_Ref(): No of messages: 1 >> > [1] MatStashScatterBegin_Ref(): Mesg_to: 0: size: 7713792 bytes >> > [1] MatAssemblyBegin_MPIAIJ(): Stash has 482112 entries, uses 0 mallocs. >> > [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 599214; storage >> > space: 0 >> > unneeded,15128672 used >> > [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >> > 0 >> > [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27 >> > [1] MatCheckCompressedRow(): Found the ratio (num_zerorows >> > 0)/(num_localrows >> > 599214) < 0.6. Do not use CompressedRow routines. >> > [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 621594; storage >> > space: 0 >> > unneeded,15545404 used >> > [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >> > 0 >> > [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27 >> > [0] MatCheckCompressedRow(): Found the ratio (num_zerorows >> > 0)/(num_localrows >> > 621594) < 0.6. Do not use CompressedRow routines. >> > [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 15700; storage space: >> > 0 >> > unneeded,136718 used >> > [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >> > 0 >> > [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 19 >> > [1] MatCheckCompressedRow(): Found the ratio (num_zerorows >> > 582718)/(num_localrows 599214) > 0.6. Use CompressedRow routines. >> > [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 16496; storage space: >> > 0 >> > unneeded,136718 used >> > [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >> > 0 >> > [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 16 >> > [0] MatCheckCompressedRow(): Found the ratio (num_zerorows >> > 605894)/(num_localrows 621594) > 0.6. Use CompressedRow routines. >> > K and q SetValues took 2.366 s >> > [0] PCSetUp(): Setting up PC with same nonzero pattern >> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374780 >> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374782 >> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374780 >> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374782 >> > [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374780 >> > [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >> > -2080374782 >> > [0] VecScatterCreateCommon_PtoS(): Using MPI_Alltoallv() for scatter >> > [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter >> > [0] VecScatterCreate(): General case: Seq to MPI >> > [1] VecScatterCreate(): General case: Seq to MPI >> > Solution took 82.156 s >> > >> > -----Original Message----- >> > From: Barry Smith [mailto:[email protected]] >> > Sent: Wednesday, October 05, 2016 4:42 PM >> > To: [email protected] >> > Cc: [email protected] >> > Subject: Re: [petsc-users] large PetscCommDuplicate overhead >> > >> > >> >> On Oct 5, 2016, at 2:30 PM, Matthew Overholt <[email protected]> >> >> wrote: >> >> >> >> Hi Petsc-Users, >> >> >> >> I am trying to understand an issue where PetscCommDuplicate() calls are >> > taking an increasing percentage of time as I run a fixed-sized problem >> > on >> > more processes. >> >> >> >> I am using the FEM to solve the steady-state heat transfer equation >> >> (K.x = >> > q) using a PC direct solver, like MUMPS. >> >> >> >> I am running on the NERSC Cray X30, which has two Xeon's per node with >> >> 12 >> > cores each, and profiling the code using CrayPat sampling. >> >> >> >> On a typical problem (1E+6 finite elements), running on a single node: >> >> -for 2 cores (1 on each Xeon), about 1% of time is PetscCommDuplicate >> >> (on >> > process 1, but on the root it is less), and (for reference) 9% of total >> > time >> > is for MUMPS. >> >> -for 8 cores (4 on each Xeon), over 6% of time is PetscCommDuplicate >> >> (on >> > every process except the root, where it is <1%), and 9-10% of total time >> > is >> > for MUMPS. >> > >> > What does PetscCommDuplicate() have to do with MUMPS? Nothing at all, >> > you >> > are just giving its time for comparison? >> > >> >> >> >> What is the large PetscCommDuplicate time connected to, an increasing >> > number of messages (tags)? Would using fewer MatSetValues() and >> > VecSetValues() calls (with longer message lengths) alleviate this? >> > >> > No PetscCommDuplicate won't increate with more messages or calls to >> > XXXSetValues(). PetscCommDuplicate() is only called essentially on the >> > creation of new PETSc objects. It should also be fast since it >> > basically >> > needs to do just a MPI_Attr_get(). With more processes but the same >> > problem >> > size and code there should be pretty much the same number of objects >> > created. >> > >> > PetscSpinlockLock() does nothing if you are not using threads so it >> > won't >> > take any time. >> > >> > Is there a way to see where it is spending its time inside the >> > PetscCommDuplicate()? Perhaps the Cray MPI_Attr_get() has issues. >> > >> > Barry >> > >> > >> > >> > >> > >> > >> >> >> >> For reference, the PETSc calling sequence in the code is as follows. >> >> // Create the solution and RHS vectors >> >> ierr = VecCreate(petscData->mpicomm,&mesh->hpx); >> >> ierr = PetscObjectSetName((PetscObject) mesh->hpx, "Solution"); >> >> ierr = VecSetSizes(mesh->hpx,mesh->lxN,mesh->neqns); // size = # of >> > equations; distribution to match mesh >> >> ierr = VecSetFromOptions(mesh->hpx); // allow run time options >> >> ierr = VecDuplicate(mesh->hpx,&q); // create the RHS vector >> >> // Create the stiffnexx matrix >> >> ierr = MatCreate(petscData->mpicomm,&K); >> >> ierr = MatSetSizes(K,mesh->lxN,mesh->lxN,mesh->neqns,mesh->neqns); >> >> ierr = MatSetType(K,MATAIJ); // default sparse type >> >> // Do preallocation >> >> ierr = MatMPIAIJSetPreallocation(K,d_nz,NULL,o_nz,NULL); >> >> ierr = MatSeqAIJSetPreallocation(K,d_nz,NULL); >> >> ierr = MatSetOption(K,MAT_NEW_NONZERO_ALLOCATION_ERR,PETSC_FALSE); >> >> ierr = MatSetUp(K); >> >> // Create and set up the KSP context as a PreConditioner Only >> >> (Direct) >> > Solution >> >> ierr = KSPCreate(petscData->mpicomm,&ksp); >> >> ierr = KSPSetOperators(ksp,K,K); >> >> ierr = KSPSetType(ksp,KSPPREONLY); >> >> // Set the temperature vector >> >> ierr = VecSet(mesh->hpx,mesh->Tmin); >> >> // Set the default PC method as MUMPS >> >> ierr = KSPGetPC(ksp,&pc); // extract the preconditioner >> >> ierr = PCSetType(pc,PCLU); // set pc options >> >> ierr = PCFactorSetMatSolverPackage(pc,MATSOLVERMUMPS); >> >> ierr = KSPSetFromOptions(ksp); >> >> >> >> // Set the values for the K matrix and q vector >> >> // which involves a lot of these calls >> >> ierr = MatSetValues(K,mrows,idxm,ncols,idxn,pKe,ADD_VALUES); >> >> // >> > 1 call per matrix row (equation) >> >> ierr = VecSetValues(q,nqe,ixn,pqe,ADD_VALUES); // 1 call >> >> per >> > element >> >> ierr = VecAssemblyBegin(q); >> >> ierr = MatAssemblyBegin(K,MAT_FINAL_ASSEMBLY); >> >> ierr = VecAssemblyEnd(q); >> >> ierr = MatAssemblyEnd(K,MAT_FINAL_ASSEMBLY); >> >> >> >> // Solve ////////////////////////////////////// >> >> ierr = KSPSolve(ksp,q,mesh->hpx); >> >> ... >> >> *Note that the code evenly divides the finite elements over the total >> > number of processors, and I am using ghosting of the FE vertices vector >> > to >> > handle the vertices that are needed on more than 1 process. >> >> >> >> Thanks in advance for your help, >> >> Matt Overholt >> >> CapeSym, Inc. >> >> >> >> Virus-free. www.avast.com >> > >> > >> > --- >> > This email has been checked for viruses by Avast antivirus software. >> > https://www.avast.com/antivirus >> > >> > > > > -- > What most experimenters take for granted before they begin their experiments > is infinitely more interesting than any results to which their experiments > lead. > -- Norbert Wiener
