Re: [petsc-users] parallelize matrix assembly process

Barry Smith Tue, 13 Dec 2022 06:56:12 -0800

"MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 73239"


The preallocation is VERY wrong. This is why the computation is so slow; this 
number should be zero. 



> On Dec 12, 2022, at 10:20 PM, 김성익 <[email protected]> wrote:
> 
> Following your comments, 
> I checked by using '-info'.
> 
> As you suspected, most elements being computed on wrong MPI rank.
> Also, there are a lot of stashed entries.
> 
> 
> 
> Should I divide the domain from the problem define stage?
> Or is a proper preallocation sufficient?
> 
> 
> 
> [0] <sys> PetscCommDuplicate(): Duplicating a communicator 139687279637472 
> 94370404729840 max tags = 2147483647
> 
> [1] <sys> PetscCommDuplicate(): Duplicating a communicator 139620736898016 
> 94891084133376 max tags = 2147483647
> 
> [0] <mat> MatSetUp(): Warning not preallocating matrix storage
> 
> [1] <sys> PetscCommDuplicate(): Duplicating a communicator 139620736897504 
> 94891083133744 max tags = 2147483647
> 
> [0] <sys> PetscCommDuplicate(): Duplicating a communicator 139687279636960 
> 94370403730224 max tags = 2147483647
> 
> [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139620736897504 94891083133744
> 
> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139687279636960 94370403730224
> 
> [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139620736898016 94891084133376
> 
> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139687279637472 94370404729840
> 
> [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139620736898016 94891084133376
> 
> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139687279637472 94370404729840
> 
> [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139620736898016 94891084133376
> 
> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139687279637472 94370404729840
> 
> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139687279637472 94370404729840
> 
> [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139620736898016 94891084133376
> 
> [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139620736898016 94891084133376
> 
> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139687279637472 94370404729840
> 
>  TIME0 : 0.000000
> 
>  TIME0 : 0.000000
> 
> [0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 661 entries, uses 8 mallocs.
> 
> [0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0 
> mallocs.
> 
> [0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 661 entries, uses 5 mallocs.
> 
> [0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0 
> mallocs.
> 
> [0] <mat> MatAssemblyBegin_MPIAIJ(): Stash has 460416 entries, uses 5 mallocs.
> 
> [1] <mat> MatAssemblyBegin_MPIAIJ(): Stash has 461184 entries, uses 5 mallocs.
> 
> [0] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 13892 X 13892; storage space: 
> 180684 unneeded,987406 used
> 
> [0] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 
> 73242
> 
> [0] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 81
> 
> [0] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows 
> 0)/(num_localrows 13892) < 0.6. Do not use CompressedRow routines.
> 
> [0] <mat> MatSeqAIJCheckInode(): Found 4631 nodes of 13892. Limit used: 5. 
> Using Inode routines
> 
> [1] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 13891 X 13891; storage space: 
> 180715 unneeded,987325 used
> 
> [1] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 
> 73239
> 
> [1] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 81
> 
> [1] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows 
> 0)/(num_localrows 13891) < 0.6. Do not use CompressedRow routines.
> 
> [1] <mat> MatSeqAIJCheckInode(): Found 4631 nodes of 13891. Limit used: 5. 
> Using Inode routines
> 
> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139687279636960 94370403730224
> 
> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139687279636960 94370403730224
> 
> [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139620736897504 94891083133744
> 
> [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139620736897504 94891083133744
> 
> [0] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 13892 X 1390; storage space: 
> 72491 unneeded,34049 used
> 
> [0] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 
> 2472
> 
> [0] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 40
> 
> [0] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows 
> 12501)/(num_localrows 13892) > 0.6. Use CompressedRow routines.
> 
> Assemble Time : 174.079366sec
> 
> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139687279636960 94370403730224
> 
> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139687279636960 94370403730224
> 
> [1] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 13891 X 1391; storage space: 
> 72441 unneeded,34049 used
> 
> [1] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is 
> 2469
> 
> [1] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 41
> 
> [1] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows 
> 12501)/(num_localrows 13891) > 0.6. Use CompressedRow routines.
> 
> Assemble Time : 174.141234sec
> 
> [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139620736897504 94891083133744
> 
> [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139620736897504 94891083133744
> 
> [0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 13891 entries, uses 8 mallocs.
> 
> [0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0 
> mallocs.
> 
> [1] <mat> MatAssemblyEnd_SeqAIJ(): Matrix size: 13891 X 13891; storage space: 
> 0 unneeded,987325 used
> 
> [1] <mat> MatAssemblyEnd_SeqAIJ(): Number of mallocs during MatSetValues() is > 0
> 
> [1] <mat> MatAssemblyEnd_SeqAIJ(): Maximum nonzeros in any row is 81
> 
> [1] <mat> MatCheckCompressedRow(): Found the ratio (num_zerorows 
> 0)/(num_localrows 13891) < 0.6. Do not use CompressedRow routines.
> 
> [0] <pc> PCSetUp(): Setting up PC for first time
> 
> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139687279636960 94370403730224
> 
> [0] <pc> PCSetUp(): Leaving PC with identical preconditioner since operator 
> is unchanged
> 
> [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139620736897504 94891083133744
> 
> [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139620736897504 94891083133744
> 
> [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139620736897504 94891083133744
> 
> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139687279636960 94370403730224
> 
> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139687279636960 94370403730224
> 
> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139687279636960 94370403730224
> 
> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139687279636960 94370403730224
> 
> [0] <pc> PCSetUp(): Leaving PC with identical preconditioner since operator 
> is unchanged
> 
> [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139620736897504 94891083133744
> 
> [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139620736897504 94891083133744
> 
> [0] <pc> PCSetUp(): Leaving PC with identical preconditioner since operator 
> is unchanged
> 
> Solving Time : 5.085394sec
> 
> [0] <ksp> KSPConvergedDefault(): Linear solver has converged. Residual norm 
> 1.258030470407e-17 is less than relative tolerance 1.000000000000e-05 times 
> initial right hand side norm 2.579617304779e-03 at iteration 1
> 
> Solving Time : 5.089733sec
> 
> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139687279636960 94370403730224
> 
> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139687279636960 94370403730224
> 
> [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139620736897504 94891083133744
> 
> [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139620736897504 94891083133744
> 
> [0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 661 entries, uses 5 mallocs.
> 
> [0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0 
> mallocs.
> 
> [0] <mat> MatAssemblyBegin_MPIAIJ(): Stash has 460416 entries, uses 0 mallocs.
> 
> [1] <mat> MatAssemblyBegin_MPIAIJ(): Stash has 461184 entries, uses 0 mallocs.
> 
> Assemble Time : 5.242508sec
> 
> [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139620736897504 94891083133744
> 
> [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139620736897504 94891083133744
> 
> Assemble Time : 5.240863sec
> 
> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139687279636960 94370403730224
> 
> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139687279636960 94370403730224
> 
> [0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 13891 entries, uses 0 mallocs.
> 
> [0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0 
> mallocs.
> 
>  
>      TIME : 1.000000,     TIME_STEP : 1.000000,      ITER : 2,     RESIDUAL : 
> 2.761615e-03
> 
>  
>      TIME : 1.000000,     TIME_STEP : 1.000000,      ITER : 2,     RESIDUAL : 
> 2.761615e-03
> 
> [0] <pc> PCSetUp(): Setting up PC with same nonzero pattern
> 
> [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139620736897504 94891083133744
> 
> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139687279636960 94370403730224
> 
> [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139620736897504 94891083133744
> 
> [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139620736897504 94891083133744
> 
> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139687279636960 94370403730224
> 
> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139687279636960 94370403730224
> 
> [0] <pc> PCSetUp(): Leaving PC with identical preconditioner since operator 
> is unchanged
> 
> [0] <pc> PCSetUp(): Leaving PC with identical preconditioner since operator 
> is unchanged
> 
> [0] <ksp> KSPConvergedDefault(): Linear solver has converged. Residual norm 
> 1.539725065974e-19 is less than relative tolerance 1.000000000000e-05 times 
> initial right hand side norm 8.015104666105e-06 at iteration 1
> 
> Solving Time : 4.662785sec
> 
> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139687279636960 94370403730224
> 
> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139687279636960 94370403730224
> 
> Solving Time : 4.664515sec
> 
> [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139620736897504 94891083133744
> 
> [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139620736897504 94891083133744
> 
> [0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 661 entries, uses 5 mallocs.
> 
> [0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0 
> mallocs.
> 
> [1] <mat> MatAssemblyBegin_MPIAIJ(): Stash has 461184 entries, uses 0 mallocs.
> 
> [0] <mat> MatAssemblyBegin_MPIAIJ(): Stash has 460416 entries, uses 0 mallocs.
> 
> Assemble Time : 5.238257sec
> 
> [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139620736897504 94891083133744
> 
> [1] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139620736897504 94891083133744
> 
> Assemble Time : 5.236535sec
> 
> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139687279636960 94370403730224
> 
> [0] <sys> PetscCommDuplicate(): Using internal PETSc communicator 
> 139687279636960 94370403730224
> 
>  
>      TIME : 1.000000,     TIME_STEP : 1.000000,      ITER : 3,     RESIDUAL : 
> 3.705062e-08
> 
>  TIME0 : 1.000000
> 
> [0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 13891 entries, uses 0 mallocs.
> 
> [0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0 
> mallocs.
> 
>  
>      TIME : 1.000000,     TIME_STEP : 1.000000,      ITER : 3,     RESIDUAL : 
> 3.705062e-08
> 
>  TIME0 : 1.000000
> 
> [1] <sys> PetscFinalize(): PetscFinalize() called
> 
> [0] <vec> VecAssemblyBegin_MPI_BTS(): Stash has 661 entries, uses 5 mallocs.
> 
> [0] <vec> VecAssemblyBegin_MPI_BTS(): Block-Stash has 0 entries, uses 0 
> mallocs.
> 
> [0] <sys> PetscFinalize(): PetscFinalize() called
> 
> 
> 2022년 12월 13일 (화) 오전 12:50, Barry Smith <[email protected] 
> <mailto:[email protected]>>님이 작성:
>> 
>>    The problem is possibly due to most elements being computed on "wrong" 
>> MPI rank and thus requiring almost all the matrix entries to be "stashed" 
>> when computed and then sent off to the owning MPI rank.  Please send ALL the 
>> output of a parallel run with -info so we can see how much communication is 
>> done in the matrix assembly.
>> 
>>   Barry
>> 
>> 
>> > On Dec 12, 2022, at 6:16 AM, 김성익 <[email protected] 
>> > <mailto:[email protected]>> wrote:
>> > 
>> > Hello,
>> > 
>> > 
>> > I need some keyword or some examples for parallelizing matrix assemble 
>> > process.
>> > 
>> > My current state is as below.
>> > - Finite element analysis code for Structural mechanics.
>> > - problem size : 3D solid hexa element (number of elements : 125,000), 
>> > number of degree of freedom : 397,953
>> > - Matrix type : seqaij, matrix set preallocation by using 
>> > MatSeqAIJSetPreallocation
>> > - Matrix assemble time by using 1 core : 120 sec
>> >    for (int i=0; i<125000; i++) {
>> >     ~~ element matrix calculation}
>> >    matassemblybegin
>> >    matassemblyend
>> > - Matrix assemble time by using 8 core : 70,234sec
>> >   int start, end;
>> >   VecGetOwnershipRange( element_vec, &start, &end);
>> >   for (int i=start; i<end; i++){
>> >    ~~ element matrix calculation
>> >    matassemblybegin
>> >    matassemblyend
>> > 
>> > 
>> > As you see the state, the parallel case spent a lot of time than 
>> > sequential case..
>> > How can I speed up in this case?
>> > Can I get some keyword or examples for parallelizing assembly of matrix in 
>> > finite element analysis ?
>> > 
>> > Thanks,
>> > Hyung Kim
>> > 
>>

Re: [petsc-users] parallelize matrix assembly process

Reply via email to