Re: [petsc-users] Petsc4py Simulation: Mat.axpy() slow

Barry Smith Sat, 11 Nov 2023 16:59:13 -0800

   How many MPI processes did you use?    Please try with just one to get a 
base line


MatSetValues        1298 1.0 1.2313e-01 1.1 1.18e+07 1.0 0.0e+00 0.0e+00 
0.0e+00  1 32  0  0  0   1 32  0  0  0   669
MatGetRow           1298 1.0 1.6896e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAXPY                1 1.0 1.7225e-01 1.0 1.18e+07 1.0 8.4e+01 1.3e+03 
6.0e+00  1 32  5 14  0   1 32  5 14  0   478

   This is concerning:  there are 1298 MatSetValues and MatGetRow, are you 
calling them? If not that means the MatAXPY is calling them (if 
SAME_NONZERO_PATTERN is not used these are used in the MatAXPY). 

> I'm still somewhat new to parallel computing, so I'm not sure what you mean 
> by binding. Does this lock in certain processes to certain cores? 


   Yes, it is usually the right thing to do for numerical computing. I included 
a link to some discussion of it on petsc.org

> On Nov 11, 2023, at 6:38 PM, Donald Rex Planalp <[email protected]> 
> wrote:
> 
> Hello again,
> 
> I've run the simulation with profiling again. In this setup I only ran the 
> necessary methods to construct the matrices, and then at the end I added 
> H_mix and H_ang using the Mataxpy method. Below are the results 
> 
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flop                           
>    --- Global ---  --- Stage ----  Total
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> ------------------------------------------------------------------------------------------------------------------------
> 
> --- Event Stage 0: Main Stage
> 
> BuildTwoSided        350 1.0 6.9605e-01 1.9 0.00e+00 0.0 3.9e+02 5.6e+00 
> 3.5e+02  3  0 24  0 22   3  0 24  0 22     0
> BuildTwoSidedF       236 1.0 6.9569e-01 1.9 0.00e+00 0.0 2.3e+02 1.1e+01 
> 2.4e+02  3  0 14  0 15   3  0 14  0 15     0
> SFSetGraph           114 1.0 2.0555e-04 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> SFSetUp              116 1.0 1.2123e-03 1.1 0.00e+00 0.0 6.2e+02 8.9e+02 
> 1.1e+02  0  0 38 71  7   0  0 38 71  7     0
> SFPack               312 1.0 6.2350e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> SFUnpack             312 1.0 2.2470e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecView               26 1.0 4.0207e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecCopy               13 1.0 7.6100e-06 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecAssemblyBegin      14 1.0 2.9690e-04 2.1 0.00e+00 0.0 2.3e+02 1.1e+01 
> 1.4e+01  0  0 14  0  1   0  0 14  0  1     0
> VecAssemblyEnd        14 1.0 2.9400e-05 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecScatterBegin      312 1.0 6.0288e-04 1.6 0.00e+00 0.0 7.3e+02 3.1e+02 
> 1.0e+02  0  0 45 29  6   0  0 45 29  7     0
> VecScatterEnd        312 1.0 9.1886e-04 3.3 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> VecSetRandom           2 1.0 3.6500e-06 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatMult              104 1.0 2.5355e-02 1.0 2.65e+05 1.1 8.0e+02 2.8e+02 
> 2.2e+02  0  1 49 29 14   0  1 49 29 14    70
> MatSolve             104 1.0 2.4431e-02 1.0 1.31e+05 1.2 8.0e+02 2.8e+02 
> 1.1e+02  0  0 49 29  7   0  0 49 29  7    36
> MatLUFactorSym         2 1.0 1.2584e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 4.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatLUFactorNum         2 1.0 7.7804e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatScale               2 1.0 4.5068e-02 1.0 2.36e+07 1.0 0.0e+00 0.0e+00 
> 0.0e+00  0 65  0  0  0   0 65  0  0  0  3657
> MatAssemblyBegin     169 1.0 4.2296e-01 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 
> 1.1e+02  2  0  0  0  7   2  0  0  0  7     0
> MatAssemblyEnd       169 1.0 3.6979e+00 1.0 0.00e+00 0.0 5.9e+02 9.4e+02 
> 7.8e+02 22  0 36 71 48  22  0 36 71 49     0
> MatSetValues        1298 1.0 1.2313e-01 1.1 1.18e+07 1.0 0.0e+00 0.0e+00 
> 0.0e+00  1 32  0  0  0   1 32  0  0  0   669
> MatGetRow           1298 1.0 1.6896e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatAXPY                1 1.0 1.7225e-01 1.0 1.18e+07 1.0 8.4e+01 1.3e+03 
> 6.0e+00  1 32  5 14  0   1 32  5 14  0   478
> PCSetUp                2 1.0 2.0848e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 8.0e+00  0  0  0  0  0   0  0  0  0  1     0
> PCApply              104 1.0 2.4459e-02 1.0 1.31e+05 1.2 8.0e+02 2.8e+02 
> 1.1e+02  0  0 49 29  7   0  0 49 29  7    36
> KSPSetUp               2 1.0 1.8400e-06 5.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> KSPSolve             104 1.0 2.5008e-02 1.0 1.31e+05 1.2 8.0e+02 2.8e+02 
> 2.2e+02  0  0 49 29 14   0  0 49 29 14    35
> EPSSetUp               2 1.0 2.4027e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 1.8e+01  0  0  0  0  1   0  0  0  0  1     0
> EPSSolve               2 1.0 3.0789e-02 1.0 1.09e+06 1.1 8.0e+02 2.8e+02 
> 4.5e+02  0  3 49 29 28   0  3 49 29 28   242
> STSetUp                2 1.0 2.1294e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 8.0e+00  0  0  0  0  0   0  0  0  0  1     0
> STComputeOperatr       2 1.0 2.1100e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> STApply              104 1.0 2.5290e-02 1.0 2.65e+05 1.1 8.0e+02 2.8e+02 
> 2.2e+02  0  1 49 29 14   0  1 49 29 14    70
> STMatSolve           104 1.0 2.5081e-02 1.0 1.31e+05 1.2 8.0e+02 2.8e+02 
> 2.2e+02  0  0 49 29 14   0  0 49 29 14    35
> BVCopy                20 1.0 2.8770e-05 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> BVMultVec            206 1.0 1.8736e-04 1.5 3.13e+05 1.1 0.0e+00 0.0e+00 
> 0.0e+00  0  1  0  0  0   0  1  0  0  0 11442
> BVMultInPlace         11 1.0 9.2730e-05 1.2 1.49e+05 1.1 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 11026
> BVDotVec             206 1.0 1.0389e-03 1.2 3.35e+05 1.1 0.0e+00 0.0e+00 
> 2.1e+02  0  1  0  0 13   0  1  0  0 13  2205
> BVOrthogonalizeV     106 1.0 1.2649e-03 1.1 6.84e+05 1.1 0.0e+00 0.0e+00 
> 2.1e+02  0  2  0  0 13   0  2  0  0 13  3706
> BVScale              106 1.0 9.0329e-05 1.9 5.51e+03 1.1 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0   418
> BVSetRandom            2 1.0 1.4080e-05 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> BVMatMultVec         104 1.0 2.5468e-02 1.0 2.65e+05 1.1 8.0e+02 2.8e+02 
> 2.2e+02  0  1 49 29 14   0  1 49 29 14    70
> DSSolve                9 1.0 1.2452e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> DSVectors             24 1.0 1.5345e-04 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> DSOther               27 1.0 1.6735e-04 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> ------------------------------------------------------------------------------------------------------------------------
> 
> I believe that the only matrix addition I did is shown there to take about 
> 0.17 seconds. Doing so twice per timestep in addition to scaling, this ends 
> up taking about 0.4 seconds
> during the full calculation so the timestep is far too slow.
> 
> However it seems that the number of flops is on the order of 10^7. Would this 
> imply that 
> I am  perhaps memory bandwidth bound on my home pc?
> 
> I'm still somewhat new to parallel computing, so I'm not sure what you mean 
> by binding. Does this lock in certain processes to certain cores? 
> 
> Thank you for your time and assistance
> 
> On Sat, Nov 11, 2023 at 2:54 PM Barry Smith <[email protected] 
> <mailto:[email protected]>> wrote:
>> 
>>   Here is the code for MPIBAIJ with SAME_NONZERO_STRUCTURE. (the other 
>> formats are similar)
>> 
>> PetscErrorCode MatAXPY_SeqBAIJ(Mat Y, PetscScalar a, Mat X, MatStructure str)
>> {
>>   Mat_SeqBAIJ *x = (Mat_SeqBAIJ *)X->data, *y = (Mat_SeqBAIJ *)Y->data;
>>   PetscInt     bs = Y->rmap->bs, bs2 = bs * bs;
>>   PetscBLASInt one = 1;
>> ....
>> if (str == SAME_NONZERO_PATTERN) {
>>     PetscScalar  alpha = a;
>>     PetscBLASInt bnz;
>>     PetscCall(PetscBLASIntCast(x->nz * bs2, &bnz));
>>     PetscCallBLAS("BLASaxpy", BLASaxpy_(&bnz, &alpha, x->a, &one, y->a, 
>> &one));
>>     PetscCall(PetscObjectStateIncrease((PetscObject)Y));
>> 
>> It directly adds the nonzero values from the two matrices together (the 
>> nonzero structure plays no role) and it uses the BLAS so it should perform 
>> as "fast as possible" given the hardware (are you configured 
>> --with-debugging=0 ?, are you using binding with mpiexec to ensure MPI is 
>> using the best combination of cores? 
>> https://petsc.org/release/faq/#what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup
>>  
>> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpetsc.org%2Frelease%2Ffaq%2F%23what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup&data=05%7C01%7CDonald.Planalp%40colorado.edu%7C5331c73c8037444dcb6208dbe300c6f9%7C3ded8b1b070d462982e4c0b019f46057%7C1%7C0%7C638353364712831930%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=g2y%2BXYoKqVnJ4P79Bx2tEgz4T%2BBpu2g6rYWi0nIg%2BBc%3D&reserved=0>)
>> 
>>  Can you run with -log_view and send the output so we can see directly the 
>> performance?
>> 
>> 
>> 
>> 
>>> On Nov 11, 2023, at 4:25 PM, Donald Rex Planalp 
>>> <[email protected] <mailto:[email protected]>> wrote:
>>> 
>>> Hello, and thank you for the quick reply.
>>> 
>>> For some context, all of these matrices are produced from a kronecker 
>>> product in parallel between an "angular" and a "radial" matrix. In this 
>>> case S_total and H_atom use the same angular matrix which is the identity, 
>>> while their 
>>> radial matrices obey different sparsity.
>>> 
>>>  Meanwhile H_mix and H_ang are constructed from two different angular 
>>> matrices, however both are only nonzero along the two off diagonals, and 
>>> the radial matrices have the same sparse structure. 
>>> 
>>> So S_total and H_atom have the same block sparse structure, while H_mix and 
>>> H_ang have the same block sparse structure. However even just adding H_mix 
>>> and H_ang for the simplest case takes around half a second which is still 
>>> too slow even when using SAME_NONZERO_PATTERN . 
>>> 
>>> This confuses me the most because another researcher wrote a similar code a 
>>> few years back using finite difference, so his matrices are on the order 
>>> 250kx250k and larger, yet the matrix additions are far faster for him. 
>>> 
>>> If you would like I can show more code snippets but im not sure what you 
>>> would want to see.
>>> 
>>> Thank you for your time
>>> 
>>> 
>>> 
>>> On Sat, Nov 11, 2023 at 11:22 AM Barry Smith <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> 
>>>>  DIFFERENT_NONZERO_PATTERN will always be slow because it needs to 
>>>> determine the nonzero pattern of the result for each computation.
>>>> 
>>>>  SAME_NONZERO_PATTERN can obviously run fast but may not be practical for 
>>>> your problem. 
>>>> 
>>>>  SUBSET_NONZERO_PATTERN (which means in A += b*B we know that B has NO 
>>>> nonzero locations that A does not have) does not need to reallocate 
>>>> anything in A so it can be reasonably fast (we can do more optimizations 
>>>> on the current code to improve it for you).
>>>> 
>>>>   So, can you construct the S nonzero structure initially to have the 
>>>> union of the nonzero structures of all the matrices that accumulate into 
>>>> it? This is the same
>>>> nonzero structure that it "ends up with" in the current code. With this 
>>>> nonzero structure, SUBSET_NONZERO_PATTERN can be used. Depending on how 
>>>> sparse all the matrices that get accumulated into S are, you could perhaps 
>>>> go further and make sure they all have the same nonzero structure (sure, 
>>>> it uses more memory and will do extra operations, but the actual 
>>>> computation will be so fast it may be worth the extra computations.
>>>> 
>>>>   Barry
>>>> 
>>>> 
>>>> 
>>>>> On Nov 10, 2023, at 9:53 PM, Donald Rex Planalp 
>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> I am trying to use petsc4py to conduct a quantum mechanics simulation. 
>>>>> I've been able to construct all of the relevant matrices, however I am 
>>>>> reaching a gigantic bottleneck. 
>>>>> 
>>>>> For the simplest problem I am running I have a few matrices each about 
>>>>> 5000x5000. In order to begin time propagation I need to add these 
>>>>> matrices together. However, on 6 cores of my local machine it is taking 
>>>>> approximately 1-2 seconds per addition. Since I need to do this for each 
>>>>> time step in my simulation it is prohibitively slow since there could be 
>>>>> upwards of 10K time steps. 
>>>>> 
>>>>> Below is the relevant code:
>>>>> 
>>>>> structure = structure=PETSc.Mat.Structure.DIFFERENT_NONZERO_PATTERN
>>>>>     if test2:
>>>>>         def makeLeft(S,MIX,ANG,ATOM,i):
>>>>>             
>>>>> 
>>>>>             S.axpy(-Field.pulse[i],MIX,structure)
>>>>>             S.axpy(-Field.pulse[i],ANG,structure)
>>>>>             S.axpy(-1,ATOM,structure)
>>>>>             return S
>>>>>         def makeRight(S,MIX,ANG,ATOM,i):
>>>>>             
>>>>>             
>>>>>             
>>>>>             S.axpy(Field.pulse[i],MIX,structure)
>>>>>             S.axpy(Field.pulse[i],ANG,structure)
>>>>>             S.axpy(1,ATOM,structure)
>>>>>             
>>>>>             return S
>>>>> 
>>>>>         H_mix = Int.H_mix
>>>>>         H_mix.scale(1j * dt /2)
>>>>>   
>>>>>         H_ang = Int.H_ang
>>>>>         H_ang.scale(1j * dt /2)
>>>>> 
>>>>>         H_atom = Int.H_atom
>>>>>         H_atom.scale(1j * dt /2)
>>>>> 
>>>>>         S = Int.S_total
>>>>> 
>>>>>         psi_initial = psi.psi_initial.copy()
>>>>>         ksp = PETSc.KSP().create(PETSc.COMM_WORLD)
>>>>> 
>>>>>    
>>>>>         for i,t in enumerate(box.t):
>>>>>             print(i,L)
>>>>> 
>>>>>             
>>>>> 
>>>>>             O_L = makeLeft(S,H_mix,H_ang,H_atom,i)
>>>>>             O_R = makeRight(S,H_mix,H_ang,H_atom,i)
>>>>> 
>>>>>             
>>>>> 
>>>>>             if i == 0:
>>>>>                 known = O_R.getVecRight()
>>>>>                 sol = O_L.getVecRight()
>>>>>         
>>>>>             O_R.mult(psi_initial,known)
>>>>> 
>>>>>             ksp.setOperators(O_L)
>>>>>         
>>>>>             
>>>>>             ksp.solve(known,sol)
>>>>> 
>>>>>             
>>>>>             
>>>>>     
>>>>>             psi_initial.copy(sol)
>>>>> 
>>>>> 
>>>>> I need to clean it up a bit, but the main point is that I need to add the 
>>>>> matrices many times for a single time step. I can't preallocate memory 
>>>>> very well since some of the matrices aren't the most sparse either. It 
>>>>> seems if I cannot speed up the addition it will be difficult to continue 
>>>>> so I was wondering if you had any insights.
>>>>> 
>>>>> Thank you for your time
>>>> 
>>

Re: [petsc-users] Petsc4py Simulation: Mat.axpy() slow

Reply via email to