Thank you very much☺! Runfeng
Barry Smith <[email protected]> 于2023年7月3日周一 22:52写道: > > > On Jul 3, 2023, at 10:11 AM, Runfeng Jin <[email protected]> wrote: > > Hi, > >> We use a hash table to store the nonzeros on the fly, and then convert to >> packed storage on assembly. >> > > There is "extra memory" since the matrix entries are first stored in a > hash and then converted into the regular CSR format, so for a short while, > both copies are in memory. > > We use the amazing khash package, include/petsc/private/khash/khash.h, > our code is scattered around a bit depending on the matrix format we will > be forming. > > cd src/mat > > git grep "_Hash(" > > impls/aij/mpi/mpiaij.c:/* defines MatSetValues_MPI*_Hash(*), > MatAssemblyBegin_MPI*_Hash(*), and MatAssemblyEnd_MPI*_Hash(*) */ > > impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:/* defines > MatSetValues_MPICUSPARSE*_Hash(*) */ > > impls/aij/mpi/mpicusparse/mpiaijcusparse.cu: PetscCall(MatSetUp_MPI > *_Hash(*A)); > > impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatSetValues_MPI*_Hash(*Mat > A, PetscInt m, const PetscInt *rows, PetscInt n, const PetscInt *cols, > const PetscScalar *values, InsertMode addv) > > impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatAssemblyBegin_MPI > *_Hash(*Mat A, PETSC_UNUSED MatAssemblyType type) > > impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatAssemblyEnd_MPI > *_Hash(*Mat A, MatAssemblyType type) > > impls/aij/mpi/mpihashmat.h: PetscCall(MatSetValues_MPI*_Hash(*A, > 1, row + i, ncols, col + i, val + i, A->insertmode)); > > impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatDestroy_MPI*_Hash(*Mat > A) > > impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatZeroEntries_MPI > *_Hash(*PETSC_UNUSED Mat A) > > impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatSetRandom_MPI*_Hash(*Mat > A, PETSC_UNUSED PetscRandom r) > > impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatSetUp_MPI*_Hash(*Mat > A) > > impls/aij/seq/aij.c:/* defines MatSetValues_Seq*_Hash(*), > MatAssemblyEnd_Seq*_Hash(*), MatSetUp_Seq*_Hash(*) */ > > impls/aij/seq/seqhashmat.h:static PetscErrorCode MatAssemblyEnd_Seq > *_Hash(*Mat A, MatAssemblyType type) > > impls/aij/seq/seqhashmat.h: A->preallocated = PETSC_FALSE; /* this was > set to true for the MatSetValues*_Hash(*) to work */ > > impls/aij/seq/seqhashmat.h:static PetscErrorCode MatDestroy_Seq*_Hash(*Mat > A) > > impls/aij/seq/seqhashmat.h:static PetscErrorCode MatZeroEntries_Seq > *_Hash(*Mat A) > > impls/aij/seq/seqhashmat.h:static PetscErrorCode MatSetRandom_Seq*_Hash(*Mat > A, PetscRandom r) > > impls/aij/seq/seqhashmat.h:static PetscErrorCode MatSetUp_Seq*_Hash(*Mat > A) > > impls/baij/mpi/mpibaij.c:/* defines MatSetValues_MPI*_Hash(*), > MatAssemblyBegin_MPI*_Hash(*), and MatAssemblyEnd_MPI*_Hash(*) */ > > impls/baij/seq/baij.c:/* defines MatSetValues_Seq*_Hash(*), > MatAssemblyEnd_Seq*_Hash(*), MatSetUp_Seq*_Hash(*) */ > > impls/sbaij/mpi/mpisbaij.c:/* defines MatSetValues_MPI*_Hash(*), > MatAssemblyBegin_MPI*_Hash(*), MatAssemblyEnd_MPI*_Hash(*), MatSetUp_MPI > *_Hash(*) */ > > impls/sbaij/seq/sbaij.c:/* defines MatSetValues_Seq*_Hash(*), > MatAssemblyEnd_Seq*_Hash(*), MatSetUp_Seq*_Hash(*) */ > > Thanks for the numbers, it is good to see the performance is so similar to > that obtained when providing preallocation information. > > > Maybe can you tell me which file implements this function? > > Runfeng > > > > Runfeng Jin <[email protected]> 于2023年7月3日周一 22:05写道: > >> Thank you for all your help! >> >> Runfeng >> >> Matthew Knepley <[email protected]> 于2023年7月3日周一 22:03写道: >> >>> On Mon, Jul 3, 2023 at 9:56 AM Runfeng Jin <[email protected]> wrote: >>> >>>> Hi, impressive performance! >>>> I use the newest version of petsc(release branch), and it almost >>>> deletes all assembly and stash time in large processors (assembly time >>>> 64-4s/128-2s/256-0.2s, stash time all below 2s). For the zero programming >>>> cost, it really incredible. >>>> The order code has a regular arrangement of the number of >>>> nonzero-elements across rows, so I can have a good rough preallocation. And >>>> from the data, dedicatedly arrange data and roughly acquiring the max >>>> number of non-zero elements in rows can have a better performance than the >>>> new version without preallocation. However, in reality, I will use the >>>> newer version without preallocation for:1)less effort in programming and >>>> also nearly the same good performance 2) good memory usage(I see no >>>> unneeded memory after assembly) 3) dedicated preallocation is usually not >>>> very easy and cause extra time cost. >>>> Maybe it will be better that leave some space for the user to do a >>>> slight direction for the preallocation and thus acquire better performance. >>>> But have no idea how to direct it. >>>> And I am very curious about how petsc achieves this. How can it not >>>> know anything but achieve so good performance, and no wasted memory? May >>>> you have an explanation about this? >>>> >>> >>> We use a hash table to store the nonzeros on the fly, and then convert >>> to packed storage on assembly. >>> >>> Thanks, >>> >>> Matt >>> >>> >>>> assemble time: >>>> version\processors 4 8 16 32 >>>> 64 128 256 >>>> old 14677s 4694s 1124s 572s >>>> 38s 8s 2s >>>> new 50s 28s 15s >>>> 7.8s 4s 2s 0.4s >>>> older 27s 24s 19s >>>> 12s 14s - - >>>> stash time(max among all processors): >>>> version\processors 4 8 16 32 >>>> 64 128 256 >>>> old 3145s 2554s 673s 329s >>>> 201s 142s 138s >>>> new 2s 1s ~0s >>>> ~0s ~0s ~0s ~0s >>>> older 10s 73s 18s >>>> 5s 1s - - >>>> old: my poor preallocation >>>> new: newest version of petsc that do not preallocation >>>> older: the best preallocation version of my code. >>>> >>>> >>>> Runfeng >>>> >>>> Barry Smith <[email protected]> 于2023年7月3日周一 12:19写道: >>>> >>>>> >>>>> The main branch of PETSc now supports filling sparse matrices >>>>> without providing any preallocation information. >>>>> >>>>> You can give it a try. Use your current fastest code but just >>>>> remove ALL the preallocation calls. I would be interested in what kind of >>>>> performance you get compared to your best current performance. >>>>> >>>>> Barry >>>>> >>>>> >>>>> >>>>> On Jul 2, 2023, at 11:24 PM, Runfeng Jin <[email protected]> wrote: >>>>> >>>>> Hi! Good advice! >>>>> I set value with MatSetValues() API, which sets one part of a row >>>>> at a time(I use a kind of tiling technology so I cannot get all values of >>>>> a >>>>> row at a time). >>>>> I tested the number of malloc in these three cases. The number of >>>>> mallocs is decreasing with the increase of processors, and all these are >>>>> very large(the matrix is 283234*283234, as can see in the following). This >>>>> should be due to the unqualified preallocation. I use a rough >>>>> preallocation, that every processor counts the number of nonzero elements >>>>> for the first 10 rows, and uses the largest one to preallocate memory for >>>>> all local rows. It seems that not work well. >>>>> >>>>> number_of_processors number_of_max_mallocs_among_all_processors >>>>> 64 20000 >>>>> 128 17000 >>>>> 256 11000 >>>>> >>>>> I change my way to preallocate. I evenly take 100 rows in every >>>>> local matrix and take the largest one to preallocate memory for all local >>>>> rows. Now the assemble time is reduced to a very small time. >>>>> >>>>> number_of_processors number_of_max_mallocs_among_all_processors >>>>> 64 3000 >>>>> 128 700 >>>>> 256 500 >>>>> >>>>> Event Count Time (sec) Flop >>>>> --- Global --- >>>>> --- Stage ---- Total >>>>> Max Ratio Max Ratio Max Ratio >>>>> Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R >>>>> Mflop/s >>>>> 64 1 1.0 3.8999e+01 1.0 0.00e+00 0.0 >>>>> 7.1e+03 2.9e+05 1.1e+01 15 0 1 8 3 15 0 1 8 >>>>> 3 0 >>>>> >>>>> 128 1 1.0 8.5714e+00 1.0 0.00e+00 0.0 >>>>> 2.6e+04 8.1e+04 1.1e+01 5 0 1 4 3 5 0 1 >>>>> 4 >>>>> 3 0 >>>>> 256 1 1.0 2.5512e+00 1.0 0.00e+00 0.0 >>>>> 1.0e+05 2.3e+04 1.1e+01 2 0 1 3 3 2 0 1 >>>>> 3 >>>>> 3 0 >>>>> >>>>> So the reason "why assemble time is smaller with the increasing number >>>>> of processors " may be because more processors divide the malloc job so >>>>> that total time is reduced? >>>>> If so, I still have some questions: >>>>> 1. If preallocation is not accurate, will the performance of the >>>>> assembly be affected? I mean, when processors receive the elements that >>>>> should be stored in their local by MPI, then will the new mallocs happen >>>>> at this time point? >>>>> 2. I can not give an accurate preallocation for the large cost, so >>>>> is there any better way to preallocate for my situation? >>>>> >>>>> >>>>> >>>>> Barry Smith <[email protected]> 于2023年7月2日周日 00:16写道: >>>>> >>>>>> >>>>>> I see no reason not to trust the times below, they seem >>>>>> reasonable. You get more than 2 times speed from 64 to 128 and then about >>>>>> 1.38 from 128 to 256. >>>>>> >>>>>> The total amount of data moved (number of messages moved times >>>>>> average length) goes from 7.0e+03 * 2.8e+05 1.9600e+09 to 2.1060e+09 >>>>>> to 2.3000e+09. A pretty moderate amount of data increase, but note that >>>>>> each time you double the number of ranks, you also increase substantially >>>>>> the network's hardware to move data, so one would hope for a good speed >>>>>> up. >>>>>> >>>>>> Also, the load balance is very good, near 1. Often with assembly, >>>>>> we see very out-of-balance, and it is difficult to get good speedup when >>>>>> the balance is really off. >>>>>> >>>>>> It looks like over 90% of the entire run time is coming from >>>>>> setting and assembling the values? Also the setting values time dominates >>>>>> assembly time more with more ranks. Are you setting a single value at a >>>>>> time or a collection of them? How big are the vectors? >>>>>> >>>>>> Run all three cases with -info :vec to see some information about >>>>>> how many mallocs where move to hold the stashed vector entries. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Jun 30, 2023, at 10:25 PM, Runfeng Jin <[email protected]> >>>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> Hi, >>>>>> Thanks for your reply. I try to use PetscLogEvent(), and the >>>>>> result shows same conclusion. >>>>>> What I have done is : >>>>>> ---------------- >>>>>> PetscLogEvent Mat_assemble_event, Mat_setvalue_event, >>>>>> Mat_setAsse_event; >>>>>> PetscClassId classid; >>>>>> PetscLogDouble user_event_flops; >>>>>> PetscClassIdRegister("Test assemble and set value", &classid); >>>>>> PetscLogEventRegister("Test only assemble", classid, >>>>>> &Mat_assemble_event); >>>>>> PetscLogEventRegister("Test only set values", classid, >>>>>> &Mat_setvalue_event); >>>>>> PetscLogEventRegister("Test both assemble and set values", >>>>>> classid, &Mat_setAsse_event); >>>>>> PetscLogEventBegin(Mat_setAsse_event, 0, 0, 0, 0); >>>>>> PetscLogEventBegin(Mat_setvalue_event, 0, 0, 0, 0); >>>>>> ...compute elements and use MatSetValues. No call for assembly >>>>>> PetscLogEventEnd(Mat_setvalue_event, 0, 0, 0, 0); >>>>>> >>>>>> PetscLogEventBegin(Mat_assemble_event, 0, 0, 0, 0); >>>>>> MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY); >>>>>> MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY); >>>>>> PetscLogEventEnd(Mat_assemble_event, 0, 0, 0, 0); >>>>>> PetscLogEventEnd(Mat_setAsse_event, 0, 0, 0, 0); >>>>>> ---------------- >>>>>> >>>>>> And the output as follows. By the way, dose petsc recorde all >>>>>> time between PetscLogEventBegin and PetscLogEventEnd? or just test the >>>>>> time >>>>>> of petsc API? >>>>>> >>>>>> >>>>>> It is all of the time. >>>>>> >>>>>> ---------------- >>>>>> Event Count Time (sec) Flop >>>>>> --- Global --- --- Stage ---- Total >>>>>> Max Ratio *Max* Ratio Max Ratio Mess >>>>>> AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s >>>>>> 64new 1 1.0 *2.3775e+02* 1.0 0.00e+00 0.0 6.2e+03 >>>>>> 2.3e+04 9.0e+00 52 0 1 1 2 52 0 1 1 2 0 >>>>>> 128new 1 1.0* 6.9945e+01* 1.0 0.00e+00 0.0 2.5e+04 >>>>>> 1.1e+04 9.0e+00 30 0 1 1 2 30 0 1 1 2 0 >>>>>> 256new 1 1.0 *1.7445e+01* 1.0 0.00e+00 0.0 9.9e+04 >>>>>> 5.2e+03 9.0e+00 10 0 1 1 2 10 0 1 1 2 0 >>>>>> >>>>>> 64: >>>>>> only assemble 1 1.0 *2.6596e+02 *1.0 0.00e+00 0.0 7.0e+03 >>>>>> 2.8e+05 1.1e+01 55 0 1 8 3 55 0 1 8 3 0 >>>>>> only setvalues 1 1.0 *1.9987e+02* 1.0 0.00e+00 0.0 0.0e+00 >>>>>> 0.0e+00 0.0e+00 41 0 0 0 0 41 0 0 0 0 0 >>>>>> Test both 1 1.0 4.*6580e+02* 1.0 0.00e+00 0.0 7.0e+03 >>>>>> 2.8e+05 1.5e+01 96 0 1 8 4 96 0 1 8 4 0 >>>>>> >>>>>> 128: >>>>>> only assemble 1 1.0 *6.9718e+01* 1.0 0.00e+00 0.0 2.6e+04 >>>>>> 8.1e+04 1.1e+01 30 0 1 4 3 30 0 1 4 3 0 >>>>>> only setvalues 1 1.0 *1.4438e+02* 1.1 0.00e+00 0.0 0.0e+00 >>>>>> 0.0e+00 0.0e+00 60 0 0 0 0 60 0 0 0 0 0 >>>>>> Test both 1 1.0 *2.1417e+02* 1.0 0.00e+00 0.0 2.6e+04 >>>>>> 8.1e+04 1.5e+01 91 0 1 4 4 91 0 1 4 4 0 >>>>>> >>>>>> 256: >>>>>> only assemble 1 1.0 *1.7482e+01* 1.0 0.00e+00 0.0 1.0e+05 >>>>>> 2.3e+04 1.1e+01 10 0 1 3 3 10 0 1 3 3 0 >>>>>> only setvalues 1 1.0 *1.3717e+02* 1.1 0.00e+00 0.0 0.0e+00 >>>>>> 0.0e+00 0.0e+00 78 0 0 0 0 78 0 0 0 0 0 >>>>>> Test both 1 1.0 *1.5475e+02* 1.0 0.00e+00 0.0 1.0e+05 >>>>>> 2.3e+04 1.5e+01 91 0 1 3 4 91 0 1 3 4 0 >>>>>> >>>>>> >>>>>> >>>>>> Runfeng >>>>>> >>>>>> Barry Smith <[email protected]> 于2023年6月30日周五 23:35写道: >>>>>> >>>>>>> >>>>>>> You cannot look just at the VecAssemblyEnd() time, that will very >>>>>>> likely give the wrong impression of the total time it takes to put the >>>>>>> values in. >>>>>>> >>>>>>> You need to register a new Event and put a PetscLogEvent() just >>>>>>> before you start generating the vector entries and calling >>>>>>> VecSetValues() >>>>>>> and put the PetscLogEventEnd() just after the VecAssemblyEnd() this is >>>>>>> the >>>>>>> only way to get an accurate accounting of the time. >>>>>>> >>>>>>> Barry >>>>>>> >>>>>>> >>>>>>> > On Jun 30, 2023, at 11:21 AM, Runfeng Jin <[email protected]> >>>>>>> wrote: >>>>>>> > >>>>>>> > Hello! >>>>>>> > >>>>>>> > When I use PETSc build a sbaij matrix, I find a strange thing. >>>>>>> When I increase the number of processors, the assemble time become >>>>>>> smaller. >>>>>>> All these are totally same matrix. The assemble time mainly arouse from >>>>>>> message passing, which because I use dynamic workload that it is random >>>>>>> for >>>>>>> which elements are computed by which processor. >>>>>>> > But from instinct, if use more processors, then more possible that >>>>>>> the processor computes elements storing in other processors. But from >>>>>>> the >>>>>>> output of log_view, It seems when use more processors, the processors >>>>>>> compute more elements storing in its local(infer from that, with more >>>>>>> processors, less total amount of passed messages). >>>>>>> > >>>>>>> > What could cause this happened? Thank you! >>>>>>> > >>>>>>> > >>>>>>> > Following is the output of log_view for 64\128\256 processors. >>>>>>> Every row is time profiler of VecAssemblyEnd. >>>>>>> > >>>>>>> > >>>>>>> ------------------------------------------------------------------------------------------------------------------------ >>>>>>> > processors Count Time (sec) >>>>>>> Flop >>>>>>> --- Global --- >>>>>>> --- >>>>>>> Stage ---- Total >>>>>>> > Max Ratio Max >>>>>>> Ratio Max Ratio Mess AvgLen >>>>>>> Reduct %T %F %M %L %R %T %F %M %L %R >>>>>>> Mflop/s >>>>>>> > 64 1 1.0 2.3775e+02 >>>>>>> 1.0 0.00e+00 0.0 6.2e+03 2.3e+04 9.0e+00 >>>>>>> 52 0 1 1 2 52 0 1 1 2 >>>>>>> 0 >>>>>>> > 128 1 1.0 6.9945e+01 >>>>>>> 1.0 0.00e+00 0.0 2.5e+04 1.1e+04 9.0e+00 >>>>>>> 30 0 1 1 2 30 0 1 1 2 >>>>>>> 0 >>>>>>> > 256 1 1.0 1.7445e+01 >>>>>>> 1.0 0.00e+00 0.0 9.9e+04 5.2e+03 9.0e+00 >>>>>>> 10 0 1 1 2 10 0 1 1 2 >>>>>>> 0 >>>>>>> > >>>>>>> > Runfeng Jin >>>>>>> >>>>>>> >>>>>> >>>>> >>> >>> -- >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to which their >>> experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> <http://www.cse.buffalo.edu/~knepley/> >>> >> >
