"MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 73239"
The preallocation is VERY wrong. This is why the computation is so slow; this number should be zero. > On Dec 12, 2022, at 10:20 PM, 김성익 <[email protected]> wrote: > > Following your comments, > I checked by using '-info'. > > As you suspected, most elements being computed on wrong MPI rank. > Also, there are a lot of stashed entries. > > > > Should I divide the domain from the problem define stage? > Or is a proper preallocation sufficient? > > > > [0] <sys> PetscCommDuplicate(): Duplicating a communicator 139687279637472 > 94370404729840 max tags = 2147483647 > > [1] <sys> PetscCommDuplicate(): Duplicating a communicator 139620736898016 > 94891084133376 max tags = 2147483647 > > [0] <mat> MatSetUp(): Warning not preallocating matrix storage > > [1] <sys> PetscCommDuplicate(): Duplicating a communicator 139620736897504 > 94891083133744 max tags = 2147483647 > > [0] <sys> PetscCommDuplicate(): Duplicating a communicator 139687279636960 > 94370403730224 max tags = 2147483647 > > [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139620736897504 94891083133744 > > [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139687279636960 94370403730224 > > [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139620736898016 94891084133376 > > [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139687279637472 94370404729840 > > [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139620736898016 94891084133376 > > [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139687279637472 94370404729840 > > [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139620736898016 94891084133376 > > [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139687279637472 94370404729840 > > [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139687279637472 94370404729840 > > [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139620736898016 94891084133376 > > [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139620736898016 94891084133376 > > [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139687279637472 94370404729840 > > TIME0 : 0.000000 > > TIME0 : 0.000000 > > [0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 661 entries, uses 8 mallocs. > > [0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0 > mallocs. > > [0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 661 entries, uses 5 mallocs. > > [0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0 > mallocs. > > [0] <mat> MatAssemblyBegin_MPIAIJ(): Stash has 460416 entries, uses 5 mallocs. > > [1] <mat> MatAssemblyBegin_MPIAIJ(): Stash has 461184 entries, uses 5 mallocs. > > [0] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 13892 X 13892; storage space: > 180684 unneeded,987406 used > > [0] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is > 73242 > > [0] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 81 > > [0] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows > 0)/(num_localrows 13892) < 0.6. Do not use CompressedRow routines. > > [0] <mat> MatSeqAIJCheckInode(): Found 4631 nodes of 13892. Limit used: 5. > Using Inode routines > > [1] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 13891 X 13891; storage space: > 180715 unneeded,987325 used > > [1] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is > 73239 > > [1] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 81 > > [1] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows > 0)/(num_localrows 13891) < 0.6. Do not use CompressedRow routines. > > [1] <mat> MatSeqAIJCheckInode(): Found 4631 nodes of 13891. Limit used: 5. > Using Inode routines > > [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139687279636960 94370403730224 > > [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139687279636960 94370403730224 > > [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139620736897504 94891083133744 > > [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139620736897504 94891083133744 > > [0] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 13892 X 1390; storage space: > 72491 unneeded,34049 used > > [0] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is > 2472 > > [0] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 40 > > [0] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows > 12501)/(num_localrows 13892) > 0.6. Use CompressedRow routines. > > Assemble Time : 174.079366sec > > [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139687279636960 94370403730224 > > [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139687279636960 94370403730224 > > [1] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 13891 X 1391; storage space: > 72441 unneeded,34049 used > > [1] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is > 2469 > > [1] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 41 > > [1] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows > 12501)/(num_localrows 13891) > 0.6. Use CompressedRow routines. > > Assemble Time : 174.141234sec > > [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139620736897504 94891083133744 > > [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139620736897504 94891083133744 > > [0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 13891 entries, uses 8 mallocs. > > [0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0 > mallocs. > > [1] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 13891 X 13891; storage space: > 0 unneeded,987325 used > > [1] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is > 0 > > [1] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 81 > > [1] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows > 0)/(num_localrows 13891) < 0.6. Do not use CompressedRow routines. > > [0] <pc> PCSetUp(): Setting up PC for first time > > [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139687279636960 94370403730224 > > [0] <pc> PCSetUp(): Leaving PC with identical preconditioner since operator > is unchanged > > [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139620736897504 94891083133744 > > [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139620736897504 94891083133744 > > [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139620736897504 94891083133744 > > [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139687279636960 94370403730224 > > [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139687279636960 94370403730224 > > [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139687279636960 94370403730224 > > [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139687279636960 94370403730224 > > [0] <pc> PCSetUp(): Leaving PC with identical preconditioner since operator > is unchanged > > [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139620736897504 94891083133744 > > [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139620736897504 94891083133744 > > [0] <pc> PCSetUp(): Leaving PC with identical preconditioner since operator > is unchanged > > Solving Time : 5.085394sec > > [0] <ksp> KSPConvergedDefault(): Linear solver has converged. Residual norm > 1.258030470407e-17 is less than relative tolerance 1.000000000000e-05 times > initial right hand side norm 2.579617304779e-03 at iteration 1 > > Solving Time : 5.089733sec > > [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139687279636960 94370403730224 > > [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139687279636960 94370403730224 > > [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139620736897504 94891083133744 > > [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139620736897504 94891083133744 > > [0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 661 entries, uses 5 mallocs. > > [0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0 > mallocs. > > [0] <mat> MatAssemblyBegin_MPIAIJ(): Stash has 460416 entries, uses 0 mallocs. > > [1] <mat> MatAssemblyBegin_MPIAIJ(): Stash has 461184 entries, uses 0 mallocs. > > Assemble Time : 5.242508sec > > [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139620736897504 94891083133744 > > [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139620736897504 94891083133744 > > Assemble Time : 5.240863sec > > [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139687279636960 94370403730224 > > [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139687279636960 94370403730224 > > [0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 13891 entries, uses 0 mallocs. > > [0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0 > mallocs. > > > TIME : 1.000000, TIME_STEP : 1.000000, ITER : 2, RESIDUAL : > 2.761615e-03 > > > TIME : 1.000000, TIME_STEP : 1.000000, ITER : 2, RESIDUAL : > 2.761615e-03 > > [0] <pc> PCSetUp(): Setting up PC with same nonzero pattern > > [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139620736897504 94891083133744 > > [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139687279636960 94370403730224 > > [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139620736897504 94891083133744 > > [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139620736897504 94891083133744 > > [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139687279636960 94370403730224 > > [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139687279636960 94370403730224 > > [0] <pc> PCSetUp(): Leaving PC with identical preconditioner since operator > is unchanged > > [0] <pc> PCSetUp(): Leaving PC with identical preconditioner since operator > is unchanged > > [0] <ksp> KSPConvergedDefault(): Linear solver has converged. Residual norm > 1.539725065974e-19 is less than relative tolerance 1.000000000000e-05 times > initial right hand side norm 8.015104666105e-06 at iteration 1 > > Solving Time : 4.662785sec > > [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139687279636960 94370403730224 > > [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139687279636960 94370403730224 > > Solving Time : 4.664515sec > > [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139620736897504 94891083133744 > > [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139620736897504 94891083133744 > > [0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 661 entries, uses 5 mallocs. > > [0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0 > mallocs. > > [1] <mat> MatAssemblyBegin_MPIAIJ(): Stash has 461184 entries, uses 0 mallocs. > > [0] <mat> MatAssemblyBegin_MPIAIJ(): Stash has 460416 entries, uses 0 mallocs. > > Assemble Time : 5.238257sec > > [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139620736897504 94891083133744 > > [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139620736897504 94891083133744 > > Assemble Time : 5.236535sec > > [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139687279636960 94370403730224 > > [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator > 139687279636960 94370403730224 > > > TIME : 1.000000, TIME_STEP : 1.000000, ITER : 3, RESIDUAL : > 3.705062e-08 > > TIME0 : 1.000000 > > [0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 13891 entries, uses 0 mallocs. > > [0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0 > mallocs. > > > TIME : 1.000000, TIME_STEP : 1.000000, ITER : 3, RESIDUAL : > 3.705062e-08 > > TIME0 : 1.000000 > > [1] <sys> PetscFinalize(): PetscFinalize() called > > [0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 661 entries, uses 5 mallocs. > > [0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0 > mallocs. > > [0] <sys> PetscFinalize(): PetscFinalize() called > > > 2022년 12월 13일 (화) 오전 12:50, Barry Smith <[email protected] > <mailto:[email protected]>>님이 작성: >> >> The problem is possibly due to most elements being computed on "wrong" >> MPI rank and thus requiring almost all the matrix entries to be "stashed" >> when computed and then sent off to the owning MPI rank. Please send ALL the >> output of a parallel run with -info so we can see how much communication is >> done in the matrix assembly. >> >> Barry >> >> >> > On Dec 12, 2022, at 6:16 AM, 김성익 <[email protected] >> > <mailto:[email protected]>> wrote: >> > >> > Hello, >> > >> > >> > I need some keyword or some examples for parallelizing matrix assemble >> > process. >> > >> > My current state is as below. >> > - Finite element analysis code for Structural mechanics. >> > - problem size : 3D solid hexa element (number of elements : 125,000), >> > number of degree of freedom : 397,953 >> > - Matrix type : seqaij, matrix set preallocation by using >> > MatSeqAIJSetPreallocation >> > - Matrix assemble time by using 1 core : 120 sec >> > for (int i=0; i<125000; i++) { >> > ~~ element matrix calculation} >> > matassemblybegin >> > matassemblyend >> > - Matrix assemble time by using 8 core : 70,234sec >> > int start, end; >> > VecGetOwnershipRange( element_vec, &start, &end); >> > for (int i=start; i<end; i++){ >> > ~~ element matrix calculation >> > matassemblybegin >> > matassemblyend >> > >> > >> > As you see the state, the parallel case spent a lot of time than >> > sequential case.. >> > How can I speed up in this case? >> > Can I get some keyword or examples for parallelizing assembly of matrix in >> > finite element analysis ? >> > >> > Thanks, >> > Hyung Kim >> > >>
