Attached is a simple test code. I had no luck getting it to behavior badly.
Barry
ex3.c
Description: Binary data
> On Oct 6, 2016, at 12:06 PM, Patrick Sanan <[email protected]> wrote: > > Happy to (though Piz Daint goes down for an extended upgrade on Oct 17 > so would need to be run before then)! > > On Thu, Oct 6, 2016 at 6:03 PM, Matthew Knepley <[email protected]> wrote: >> On Thu, Oct 6, 2016 at 10:55 AM, Barry Smith <[email protected]> wrote: >>> >>> >>> Matt, >>> >>> Thanks for this information. It sure looks like there is something >>> seriously wrong with the MPI_Attr_get() on the cray for non-root process. >>> Does any PETSc developer have access to such a machine? We need to write a >>> test program that just calls MPI_Attr_get a bunch of times (no PETSc) to see >>> if we can reproduce the problem and report it to Cray. >> >> >> Barry, if you write it, we can give it to Patrick Sanan to run. >> >> Thanks, >> >> Matt >> >>> >>> Barry >>> >>> >>> >>> On Oct 6, 2016, at 10:45 AM, Matthew Overholt <[email protected]> >>> wrote: >>>> >>>> >>>> Matthew and Barry, >>>> >>>> 1) I did a direct measurement of PetscCommDuplicate() time by tracing >>>> just >>>> that call (using CrayPat), and confirmed the sampling results. For 8 >>>> processes (n=8), tracing counted a total of 101 calls, taking ~0 time on >>>> the >>>> root process but taking 11.78 seconds (6.3% of 188 total seconds) on >>>> each of >>>> the other 7 processes. For 16 processes (n=16, still only 1 node), >>>> tracing >>>> counted 102 total calls for a total of 18.42 seconds (13.2% of 139.6 >>>> total >>>> seconds) on every process except the root. >>>> >>>> 2) Copied below is a section of the log view for the first two solutions >>>> for >>>> n=2, which shows the same calls as for n=8. (I can send the entire log >>>> files >>>> if desired.) In each case I count about 44 PCD calls per process during >>>> initialization and meshing, 7 calls during setup, 9 calls for the first >>>> solution, then 3 calls for each subsequent solution (fixed-point >>>> iteration), >>>> and 3 calls to write out the solution, for 75 total. >>>> >>>> 3) I would expect that the administrators of this machine have >>>> configured >>>> PETSc appropriately. I am using their current default install, which is >>>> 3.7.2. >>>> >>>> https://www.nersc.gov/users/software/programming-libraries/math-libraries/pe >>>> tsc/ >>>> >>>> 4) Yes, I just gave the MUMPS time as a comparison. >>>> >>>> 5) As to where it is spending time, perhaps the timing results in the >>>> log >>>> files will be helpful. The "Solution took ..." printouts give the total >>>> solution time for that iteration, the others are incremental times. (As >>>> an >>>> aside, I have been wondering why the solution times do not scale well >>>> with >>>> process count, even though that work is entirely done in parallel PETSc >>>> routines.) >>>> >>>> Thanks, >>>> Matt Overholt >>>> >>>> >>>> ********** -log_view -info results for n=2 : the first solution and >>>> subsequent fixed-point iteration *********** >>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 >>>> -2080374779 >>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 >>>> -2080374779 >>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374780 >>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 >>>> -2080374781 >>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374782 >>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374780 >>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374782 >>>> Matrix setup took 0.108 s >>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 >>>> -2080374779 >>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850688 >>>> -2080374781 >>>> KSP PC setup took 0.079 s >>>> [0] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs. >>>> [0] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs. >>>> [0] MatStashScatterBegin_Ref(): No of messages: 0 >>>> [0] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs. >>>> [1] MatStashScatterBegin_Ref(): No of messages: 1 >>>> [1] MatStashScatterBegin_Ref(): Mesg_to: 0: size: 7713792 bytes >>>> [1] MatAssemblyBegin_MPIAIJ(): Stash has 482112 entries, uses 5 mallocs. >>>> [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 599214; storage >>>> space: >>>> 1050106 unneeded,15128672 used >>>> [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >>>> 0 >>>> [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27 >>>> [1] MatCheckCompressedRow(): Found the ratio (num_zerorows >>>> 0)/(num_localrows >>>> 599214) < 0.6. Do not use CompressedRow routines. >>>> [1] MatSeqAIJCheckInode(): Found 599214 nodes out of 599214 rows. Not >>>> using >>>> Inode routines >>>> [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 621594; storage >>>> space: >>>> 1237634 unneeded,15545404 used >>>> [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >>>> 0 >>>> [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27 >>>> [0] MatCheckCompressedRow(): Found the ratio (num_zerorows >>>> 0)/(num_localrows >>>> 621594) < 0.6. Do not use CompressedRow routines. >>>> [0] MatSeqAIJCheckInode(): Found 621594 nodes out of 621594 rows. Not >>>> using >>>> Inode routines >>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374780 >>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374782 >>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374780 >>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374782 >>>> [0] VecScatterCreateCommon_PtoS(): Using MPI_Alltoallv() for scatter >>>> [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter >>>> [0] VecScatterCreate(): General case: MPI to Seq >>>> [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 15700; storage space: >>>> 5257543 unneeded,136718 used >>>> [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >>>> 89 >>>> [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 19 >>>> [1] MatCheckCompressedRow(): Found the ratio (num_zerorows >>>> 582718)/(num_localrows 599214) > 0.6. Use CompressedRow routines. >>>> [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 16496; storage space: >>>> 5464978 unneeded,136718 used >>>> [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >>>> 490 >>>> [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 16 >>>> [0] MatCheckCompressedRow(): Found the ratio (num_zerorows >>>> 605894)/(num_localrows 621594) > 0.6. Use CompressedRow routines. >>>> K and q SetValues took 26.426 s >>>> [0] PCSetUp(): Setting up PC for first time >>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374782 >>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374780 >>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374782 >>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374780 >>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374780 >>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374780 >>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374782 >>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374782 >>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374782 >>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374782 >>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374780 >>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374780 >>>> [0] VecScatterCreate(): Special case: processor zero gets entire >>>> parallel >>>> vector, rest get none >>>> ** Max-trans not allowed because matrix is distributed >>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374780 >>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374782 >>>> [0] PCSetUp(): Leaving PC with identical preconditioner since operator >>>> is >>>> unchanged >>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374780 >>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374782 >>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374780 >>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374782 >>>> [0] VecScatterCreateCommon_PtoS(): Using MPI_Alltoallv() for scatter >>>> [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter >>>> [0] VecScatterCreate(): General case: Seq to MPI >>>> [1] VecScatterCreate(): General case: Seq to MPI >>>> Solution took 102.21 s >>>> >>>> NL iteration 0: delta = 32.0488 67.6279. >>>> Error delta calc took 0.045 s >>>> Node and Element temps update took 0.017 s >>>> [0] VecAssemblyBegin_MPI(): Stash has 0 entries, uses 0 mallocs. >>>> [0] VecAssemblyBegin_MPI(): Block-Stash has 0 entries, uses 0 mallocs. >>>> [0] MatStashScatterBegin_Ref(): No of messages: 0 >>>> [0] MatAssemblyBegin_MPIAIJ(): Stash has 0 entries, uses 0 mallocs. >>>> [1] MatStashScatterBegin_Ref(): No of messages: 1 >>>> [1] MatStashScatterBegin_Ref(): Mesg_to: 0: size: 7713792 bytes >>>> [1] MatAssemblyBegin_MPIAIJ(): Stash has 482112 entries, uses 0 mallocs. >>>> [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 599214; storage >>>> space: 0 >>>> unneeded,15128672 used >>>> [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >>>> 0 >>>> [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27 >>>> [1] MatCheckCompressedRow(): Found the ratio (num_zerorows >>>> 0)/(num_localrows >>>> 599214) < 0.6. Do not use CompressedRow routines. >>>> [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 621594; storage >>>> space: 0 >>>> unneeded,15545404 used >>>> [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >>>> 0 >>>> [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 27 >>>> [0] MatCheckCompressedRow(): Found the ratio (num_zerorows >>>> 0)/(num_localrows >>>> 621594) < 0.6. Do not use CompressedRow routines. >>>> [1] MatAssemblyEnd_SeqAIJ(): Matrix size: 599214 X 15700; storage space: >>>> 0 >>>> unneeded,136718 used >>>> [1] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >>>> 0 >>>> [1] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 19 >>>> [1] MatCheckCompressedRow(): Found the ratio (num_zerorows >>>> 582718)/(num_localrows 599214) > 0.6. Use CompressedRow routines. >>>> [0] MatAssemblyEnd_SeqAIJ(): Matrix size: 621594 X 16496; storage space: >>>> 0 >>>> unneeded,136718 used >>>> [0] MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is >>>> 0 >>>> [0] MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 16 >>>> [0] MatCheckCompressedRow(): Found the ratio (num_zerorows >>>> 605894)/(num_localrows 621594) > 0.6. Use CompressedRow routines. >>>> K and q SetValues took 2.366 s >>>> [0] PCSetUp(): Setting up PC with same nonzero pattern >>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374780 >>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374782 >>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374780 >>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374782 >>>> [0] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374780 >>>> [1] PetscCommDuplicate(): Using internal PETSc communicator 1140850689 >>>> -2080374782 >>>> [0] VecScatterCreateCommon_PtoS(): Using MPI_Alltoallv() for scatter >>>> [0] VecScatterCreateCommon_PtoS(): Using blocksize 1 scatter >>>> [0] VecScatterCreate(): General case: Seq to MPI >>>> [1] VecScatterCreate(): General case: Seq to MPI >>>> Solution took 82.156 s >>>> >>>> -----Original Message----- >>>> From: Barry Smith [mailto:[email protected]] >>>> Sent: Wednesday, October 05, 2016 4:42 PM >>>> To: [email protected] >>>> Cc: [email protected] >>>> Subject: Re: [petsc-users] large PetscCommDuplicate overhead >>>> >>>> >>>>> On Oct 5, 2016, at 2:30 PM, Matthew Overholt <[email protected]> >>>>> wrote: >>>>> >>>>> Hi Petsc-Users, >>>>> >>>>> I am trying to understand an issue where PetscCommDuplicate() calls are >>>> taking an increasing percentage of time as I run a fixed-sized problem >>>> on >>>> more processes. >>>>> >>>>> I am using the FEM to solve the steady-state heat transfer equation >>>>> (K.x = >>>> q) using a PC direct solver, like MUMPS. >>>>> >>>>> I am running on the NERSC Cray X30, which has two Xeon's per node with >>>>> 12 >>>> cores each, and profiling the code using CrayPat sampling. >>>>> >>>>> On a typical problem (1E+6 finite elements), running on a single node: >>>>> -for 2 cores (1 on each Xeon), about 1% of time is PetscCommDuplicate >>>>> (on >>>> process 1, but on the root it is less), and (for reference) 9% of total >>>> time >>>> is for MUMPS. >>>>> -for 8 cores (4 on each Xeon), over 6% of time is PetscCommDuplicate >>>>> (on >>>> every process except the root, where it is <1%), and 9-10% of total time >>>> is >>>> for MUMPS. >>>> >>>> What does PetscCommDuplicate() have to do with MUMPS? Nothing at all, >>>> you >>>> are just giving its time for comparison? >>>> >>>>> >>>>> What is the large PetscCommDuplicate time connected to, an increasing >>>> number of messages (tags)? Would using fewer MatSetValues() and >>>> VecSetValues() calls (with longer message lengths) alleviate this? >>>> >>>> No PetscCommDuplicate won't increate with more messages or calls to >>>> XXXSetValues(). PetscCommDuplicate() is only called essentially on the >>>> creation of new PETSc objects. It should also be fast since it >>>> basically >>>> needs to do just a MPI_Attr_get(). With more processes but the same >>>> problem >>>> size and code there should be pretty much the same number of objects >>>> created. >>>> >>>> PetscSpinlockLock() does nothing if you are not using threads so it >>>> won't >>>> take any time. >>>> >>>> Is there a way to see where it is spending its time inside the >>>> PetscCommDuplicate()? Perhaps the Cray MPI_Attr_get() has issues. >>>> >>>> Barry >>>> >>>> >>>> >>>> >>>> >>>> >>>>> >>>>> For reference, the PETSc calling sequence in the code is as follows. >>>>> // Create the solution and RHS vectors >>>>> ierr = VecCreate(petscData->mpicomm,&mesh->hpx); >>>>> ierr = PetscObjectSetName((PetscObject) mesh->hpx, "Solution"); >>>>> ierr = VecSetSizes(mesh->hpx,mesh->lxN,mesh->neqns); // size = # of >>>> equations; distribution to match mesh >>>>> ierr = VecSetFromOptions(mesh->hpx); // allow run time options >>>>> ierr = VecDuplicate(mesh->hpx,&q); // create the RHS vector >>>>> // Create the stiffnexx matrix >>>>> ierr = MatCreate(petscData->mpicomm,&K); >>>>> ierr = MatSetSizes(K,mesh->lxN,mesh->lxN,mesh->neqns,mesh->neqns); >>>>> ierr = MatSetType(K,MATAIJ); // default sparse type >>>>> // Do preallocation >>>>> ierr = MatMPIAIJSetPreallocation(K,d_nz,NULL,o_nz,NULL); >>>>> ierr = MatSeqAIJSetPreallocation(K,d_nz,NULL); >>>>> ierr = MatSetOption(K,MAT_NEW_NONZERO_ALLOCATION_ERR,PETSC_FALSE); >>>>> ierr = MatSetUp(K); >>>>> // Create and set up the KSP context as a PreConditioner Only >>>>> (Direct) >>>> Solution >>>>> ierr = KSPCreate(petscData->mpicomm,&ksp); >>>>> ierr = KSPSetOperators(ksp,K,K); >>>>> ierr = KSPSetType(ksp,KSPPREONLY); >>>>> // Set the temperature vector >>>>> ierr = VecSet(mesh->hpx,mesh->Tmin); >>>>> // Set the default PC method as MUMPS >>>>> ierr = KSPGetPC(ksp,&pc); // extract the preconditioner >>>>> ierr = PCSetType(pc,PCLU); // set pc options >>>>> ierr = PCFactorSetMatSolverPackage(pc,MATSOLVERMUMPS); >>>>> ierr = KSPSetFromOptions(ksp); >>>>> >>>>> // Set the values for the K matrix and q vector >>>>> // which involves a lot of these calls >>>>> ierr = MatSetValues(K,mrows,idxm,ncols,idxn,pKe,ADD_VALUES); >>>>> // >>>> 1 call per matrix row (equation) >>>>> ierr = VecSetValues(q,nqe,ixn,pqe,ADD_VALUES); // 1 call >>>>> per >>>> element >>>>> ierr = VecAssemblyBegin(q); >>>>> ierr = MatAssemblyBegin(K,MAT_FINAL_ASSEMBLY); >>>>> ierr = VecAssemblyEnd(q); >>>>> ierr = MatAssemblyEnd(K,MAT_FINAL_ASSEMBLY); >>>>> >>>>> // Solve ////////////////////////////////////// >>>>> ierr = KSPSolve(ksp,q,mesh->hpx); >>>>> ... >>>>> *Note that the code evenly divides the finite elements over the total >>>> number of processors, and I am using ghosting of the FE vertices vector >>>> to >>>> handle the vertices that are needed on more than 1 process. >>>>> >>>>> Thanks in advance for your help, >>>>> Matt Overholt >>>>> CapeSym, Inc. >>>>> >>>>> Virus-free. www.avast.com >>>> >>>> >>>> --- >>>> This email has been checked for viruses by Avast antivirus software. >>>> https://www.avast.com/antivirus >>>> >>> >> >> >> >> -- >> What most experimenters take for granted before they begin their experiments >> is infinitely more interesting than any results to which their experiments >> lead. >> -- Norbert Wiener
