How many MPI processes did you use? Please try with just one to get a base line
MatSetValues 1298 1.0 1.2313e-01 1.1 1.18e+07 1.0 0.0e+00 0.0e+00 0.0e+00 1 32 0 0 0 1 32 0 0 0 669 MatGetRow 1298 1.0 1.6896e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 MatAXPY 1 1.0 1.7225e-01 1.0 1.18e+07 1.0 8.4e+01 1.3e+03 6.0e+00 1 32 5 14 0 1 32 5 14 0 478 This is concerning: there are 1298 MatSetValues and MatGetRow, are you calling them? If not that means the MatAXPY is calling them (if SAME_NONZERO_PATTERN is not used these are used in the MatAXPY). > I'm still somewhat new to parallel computing, so I'm not sure what you mean > by binding. Does this lock in certain processes to certain cores? Yes, it is usually the right thing to do for numerical computing. I included a link to some discussion of it on petsc.org > On Nov 11, 2023, at 6:38 PM, Donald Rex Planalp <[email protected]> > wrote: > > Hello again, > > I've run the simulation with profiling again. In this setup I only ran the > necessary methods to construct the matrices, and then at the end I added > H_mix and H_ang using the Mataxpy method. Below are the results > > ------------------------------------------------------------------------------------------------------------------------ > Event Count Time (sec) Flop > --- Global --- --- Stage ---- Total > Max Ratio Max Ratio Max Ratio Mess AvgLen > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > ------------------------------------------------------------------------------------------------------------------------ > > --- Event Stage 0: Main Stage > > BuildTwoSided 350 1.0 6.9605e-01 1.9 0.00e+00 0.0 3.9e+02 5.6e+00 > 3.5e+02 3 0 24 0 22 3 0 24 0 22 0 > BuildTwoSidedF 236 1.0 6.9569e-01 1.9 0.00e+00 0.0 2.3e+02 1.1e+01 > 2.4e+02 3 0 14 0 15 3 0 14 0 15 0 > SFSetGraph 114 1.0 2.0555e-04 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > SFSetUp 116 1.0 1.2123e-03 1.1 0.00e+00 0.0 6.2e+02 8.9e+02 > 1.1e+02 0 0 38 71 7 0 0 38 71 7 0 > SFPack 312 1.0 6.2350e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > SFUnpack 312 1.0 2.2470e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecView 26 1.0 4.0207e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecCopy 13 1.0 7.6100e-06 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecAssemblyBegin 14 1.0 2.9690e-04 2.1 0.00e+00 0.0 2.3e+02 1.1e+01 > 1.4e+01 0 0 14 0 1 0 0 14 0 1 0 > VecAssemblyEnd 14 1.0 2.9400e-05 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecScatterBegin 312 1.0 6.0288e-04 1.6 0.00e+00 0.0 7.3e+02 3.1e+02 > 1.0e+02 0 0 45 29 6 0 0 45 29 7 0 > VecScatterEnd 312 1.0 9.1886e-04 3.3 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > VecSetRandom 2 1.0 3.6500e-06 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatMult 104 1.0 2.5355e-02 1.0 2.65e+05 1.1 8.0e+02 2.8e+02 > 2.2e+02 0 1 49 29 14 0 1 49 29 14 70 > MatSolve 104 1.0 2.4431e-02 1.0 1.31e+05 1.2 8.0e+02 2.8e+02 > 1.1e+02 0 0 49 29 7 0 0 49 29 7 36 > MatLUFactorSym 2 1.0 1.2584e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 4.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatLUFactorNum 2 1.0 7.7804e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatScale 2 1.0 4.5068e-02 1.0 2.36e+07 1.0 0.0e+00 0.0e+00 > 0.0e+00 0 65 0 0 0 0 65 0 0 0 3657 > MatAssemblyBegin 169 1.0 4.2296e-01 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 > 1.1e+02 2 0 0 0 7 2 0 0 0 7 0 > MatAssemblyEnd 169 1.0 3.6979e+00 1.0 0.00e+00 0.0 5.9e+02 9.4e+02 > 7.8e+02 22 0 36 71 48 22 0 36 71 49 0 > MatSetValues 1298 1.0 1.2313e-01 1.1 1.18e+07 1.0 0.0e+00 0.0e+00 > 0.0e+00 1 32 0 0 0 1 32 0 0 0 669 > MatGetRow 1298 1.0 1.6896e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatAXPY 1 1.0 1.7225e-01 1.0 1.18e+07 1.0 8.4e+01 1.3e+03 > 6.0e+00 1 32 5 14 0 1 32 5 14 0 478 > PCSetUp 2 1.0 2.0848e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 8.0e+00 0 0 0 0 0 0 0 0 0 1 0 > PCApply 104 1.0 2.4459e-02 1.0 1.31e+05 1.2 8.0e+02 2.8e+02 > 1.1e+02 0 0 49 29 7 0 0 49 29 7 36 > KSPSetUp 2 1.0 1.8400e-06 5.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > KSPSolve 104 1.0 2.5008e-02 1.0 1.31e+05 1.2 8.0e+02 2.8e+02 > 2.2e+02 0 0 49 29 14 0 0 49 29 14 35 > EPSSetUp 2 1.0 2.4027e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 1.8e+01 0 0 0 0 1 0 0 0 0 1 0 > EPSSolve 2 1.0 3.0789e-02 1.0 1.09e+06 1.1 8.0e+02 2.8e+02 > 4.5e+02 0 3 49 29 28 0 3 49 29 28 242 > STSetUp 2 1.0 2.1294e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 8.0e+00 0 0 0 0 0 0 0 0 0 1 0 > STComputeOperatr 2 1.0 2.1100e-06 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > STApply 104 1.0 2.5290e-02 1.0 2.65e+05 1.1 8.0e+02 2.8e+02 > 2.2e+02 0 1 49 29 14 0 1 49 29 14 70 > STMatSolve 104 1.0 2.5081e-02 1.0 1.31e+05 1.2 8.0e+02 2.8e+02 > 2.2e+02 0 0 49 29 14 0 0 49 29 14 35 > BVCopy 20 1.0 2.8770e-05 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > BVMultVec 206 1.0 1.8736e-04 1.5 3.13e+05 1.1 0.0e+00 0.0e+00 > 0.0e+00 0 1 0 0 0 0 1 0 0 0 11442 > BVMultInPlace 11 1.0 9.2730e-05 1.2 1.49e+05 1.1 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 11026 > BVDotVec 206 1.0 1.0389e-03 1.2 3.35e+05 1.1 0.0e+00 0.0e+00 > 2.1e+02 0 1 0 0 13 0 1 0 0 13 2205 > BVOrthogonalizeV 106 1.0 1.2649e-03 1.1 6.84e+05 1.1 0.0e+00 0.0e+00 > 2.1e+02 0 2 0 0 13 0 2 0 0 13 3706 > BVScale 106 1.0 9.0329e-05 1.9 5.51e+03 1.1 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 418 > BVSetRandom 2 1.0 1.4080e-05 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > BVMatMultVec 104 1.0 2.5468e-02 1.0 2.65e+05 1.1 8.0e+02 2.8e+02 > 2.2e+02 0 1 49 29 14 0 1 49 29 14 70 > DSSolve 9 1.0 1.2452e-03 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > DSVectors 24 1.0 1.5345e-04 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > DSOther 27 1.0 1.6735e-04 1.3 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > ------------------------------------------------------------------------------------------------------------------------ > > I believe that the only matrix addition I did is shown there to take about > 0.17 seconds. Doing so twice per timestep in addition to scaling, this ends > up taking about 0.4 seconds > during the full calculation so the timestep is far too slow. > > However it seems that the number of flops is on the order of 10^7. Would this > imply that > I am perhaps memory bandwidth bound on my home pc? > > I'm still somewhat new to parallel computing, so I'm not sure what you mean > by binding. Does this lock in certain processes to certain cores? > > Thank you for your time and assistance > > On Sat, Nov 11, 2023 at 2:54 PM Barry Smith <[email protected] > <mailto:[email protected]>> wrote: >> >> Here is the code for MPIBAIJ with SAME_NONZERO_STRUCTURE. (the other >> formats are similar) >> >> PetscErrorCode MatAXPY_SeqBAIJ(Mat Y, PetscScalar a, Mat X, MatStructure str) >> { >> Mat_SeqBAIJ *x = (Mat_SeqBAIJ *)X->data, *y = (Mat_SeqBAIJ *)Y->data; >> PetscInt bs = Y->rmap->bs, bs2 = bs * bs; >> PetscBLASInt one = 1; >> .... >> if (str == SAME_NONZERO_PATTERN) { >> PetscScalar alpha = a; >> PetscBLASInt bnz; >> PetscCall(PetscBLASIntCast(x->nz * bs2, &bnz)); >> PetscCallBLAS("BLASaxpy", BLASaxpy_(&bnz, &alpha, x->a, &one, y->a, >> &one)); >> PetscCall(PetscObjectStateIncrease((PetscObject)Y)); >> >> It directly adds the nonzero values from the two matrices together (the >> nonzero structure plays no role) and it uses the BLAS so it should perform >> as "fast as possible" given the hardware (are you configured >> --with-debugging=0 ?, are you using binding with mpiexec to ensure MPI is >> using the best combination of cores? >> https://petsc.org/release/faq/#what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup >> >> <https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpetsc.org%2Frelease%2Ffaq%2F%23what-kind-of-parallel-computers-or-clusters-are-needed-to-use-petsc-or-why-do-i-get-little-speedup&data=05%7C01%7CDonald.Planalp%40colorado.edu%7C5331c73c8037444dcb6208dbe300c6f9%7C3ded8b1b070d462982e4c0b019f46057%7C1%7C0%7C638353364712831930%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=g2y%2BXYoKqVnJ4P79Bx2tEgz4T%2BBpu2g6rYWi0nIg%2BBc%3D&reserved=0>) >> >> Can you run with -log_view and send the output so we can see directly the >> performance? >> >> >> >> >>> On Nov 11, 2023, at 4:25 PM, Donald Rex Planalp >>> <[email protected] <mailto:[email protected]>> wrote: >>> >>> Hello, and thank you for the quick reply. >>> >>> For some context, all of these matrices are produced from a kronecker >>> product in parallel between an "angular" and a "radial" matrix. In this >>> case S_total and H_atom use the same angular matrix which is the identity, >>> while their >>> radial matrices obey different sparsity. >>> >>> Meanwhile H_mix and H_ang are constructed from two different angular >>> matrices, however both are only nonzero along the two off diagonals, and >>> the radial matrices have the same sparse structure. >>> >>> So S_total and H_atom have the same block sparse structure, while H_mix and >>> H_ang have the same block sparse structure. However even just adding H_mix >>> and H_ang for the simplest case takes around half a second which is still >>> too slow even when using SAME_NONZERO_PATTERN . >>> >>> This confuses me the most because another researcher wrote a similar code a >>> few years back using finite difference, so his matrices are on the order >>> 250kx250k and larger, yet the matrix additions are far faster for him. >>> >>> If you would like I can show more code snippets but im not sure what you >>> would want to see. >>> >>> Thank you for your time >>> >>> >>> >>> On Sat, Nov 11, 2023 at 11:22 AM Barry Smith <[email protected] >>> <mailto:[email protected]>> wrote: >>>> >>>> >>>> DIFFERENT_NONZERO_PATTERN will always be slow because it needs to >>>> determine the nonzero pattern of the result for each computation. >>>> >>>> SAME_NONZERO_PATTERN can obviously run fast but may not be practical for >>>> your problem. >>>> >>>> SUBSET_NONZERO_PATTERN (which means in A += b*B we know that B has NO >>>> nonzero locations that A does not have) does not need to reallocate >>>> anything in A so it can be reasonably fast (we can do more optimizations >>>> on the current code to improve it for you). >>>> >>>> So, can you construct the S nonzero structure initially to have the >>>> union of the nonzero structures of all the matrices that accumulate into >>>> it? This is the same >>>> nonzero structure that it "ends up with" in the current code. With this >>>> nonzero structure, SUBSET_NONZERO_PATTERN can be used. Depending on how >>>> sparse all the matrices that get accumulated into S are, you could perhaps >>>> go further and make sure they all have the same nonzero structure (sure, >>>> it uses more memory and will do extra operations, but the actual >>>> computation will be so fast it may be worth the extra computations. >>>> >>>> Barry >>>> >>>> >>>> >>>>> On Nov 10, 2023, at 9:53 PM, Donald Rex Planalp >>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>> >>>>> Hello, >>>>> >>>>> I am trying to use petsc4py to conduct a quantum mechanics simulation. >>>>> I've been able to construct all of the relevant matrices, however I am >>>>> reaching a gigantic bottleneck. >>>>> >>>>> For the simplest problem I am running I have a few matrices each about >>>>> 5000x5000. In order to begin time propagation I need to add these >>>>> matrices together. However, on 6 cores of my local machine it is taking >>>>> approximately 1-2 seconds per addition. Since I need to do this for each >>>>> time step in my simulation it is prohibitively slow since there could be >>>>> upwards of 10K time steps. >>>>> >>>>> Below is the relevant code: >>>>> >>>>> structure = structure=PETSc.Mat.Structure.DIFFERENT_NONZERO_PATTERN >>>>> if test2: >>>>> def makeLeft(S,MIX,ANG,ATOM,i): >>>>> >>>>> >>>>> S.axpy(-Field.pulse[i],MIX,structure) >>>>> S.axpy(-Field.pulse[i],ANG,structure) >>>>> S.axpy(-1,ATOM,structure) >>>>> return S >>>>> def makeRight(S,MIX,ANG,ATOM,i): >>>>> >>>>> >>>>> >>>>> S.axpy(Field.pulse[i],MIX,structure) >>>>> S.axpy(Field.pulse[i],ANG,structure) >>>>> S.axpy(1,ATOM,structure) >>>>> >>>>> return S >>>>> >>>>> H_mix = Int.H_mix >>>>> H_mix.scale(1j * dt /2) >>>>> >>>>> H_ang = Int.H_ang >>>>> H_ang.scale(1j * dt /2) >>>>> >>>>> H_atom = Int.H_atom >>>>> H_atom.scale(1j * dt /2) >>>>> >>>>> S = Int.S_total >>>>> >>>>> psi_initial = psi.psi_initial.copy() >>>>> ksp = PETSc.KSP().create(PETSc.COMM_WORLD) >>>>> >>>>> >>>>> for i,t in enumerate(box.t): >>>>> print(i,L) >>>>> >>>>> >>>>> >>>>> O_L = makeLeft(S,H_mix,H_ang,H_atom,i) >>>>> O_R = makeRight(S,H_mix,H_ang,H_atom,i) >>>>> >>>>> >>>>> >>>>> if i == 0: >>>>> known = O_R.getVecRight() >>>>> sol = O_L.getVecRight() >>>>> >>>>> O_R.mult(psi_initial,known) >>>>> >>>>> ksp.setOperators(O_L) >>>>> >>>>> >>>>> ksp.solve(known,sol) >>>>> >>>>> >>>>> >>>>> >>>>> psi_initial.copy(sol) >>>>> >>>>> >>>>> I need to clean it up a bit, but the main point is that I need to add the >>>>> matrices many times for a single time step. I can't preallocate memory >>>>> very well since some of the matrices aren't the most sparse either. It >>>>> seems if I cannot speed up the addition it will be difficult to continue >>>>> so I was wondering if you had any insights. >>>>> >>>>> Thank you for your time >>>> >>
