Hi, > We use a hash table to store the nonzeros on the fly, and then convert to > packed storage on assembly. >
Maybe can you tell me which file implements this function? Runfeng Runfeng Jin <[email protected]> 于2023年7月3日周一 22:05写道: > Thank you for all your help! > > Runfeng > > Matthew Knepley <[email protected]> 于2023年7月3日周一 22:03写道: > >> On Mon, Jul 3, 2023 at 9:56 AM Runfeng Jin <[email protected]> wrote: >> >>> Hi, impressive performance! >>> I use the newest version of petsc(release branch), and it almost >>> deletes all assembly and stash time in large processors (assembly time >>> 64-4s/128-2s/256-0.2s, stash time all below 2s). For the zero programming >>> cost, it really incredible. >>> The order code has a regular arrangement of the number of >>> nonzero-elements across rows, so I can have a good rough preallocation. And >>> from the data, dedicatedly arrange data and roughly acquiring the max >>> number of non-zero elements in rows can have a better performance than the >>> new version without preallocation. However, in reality, I will use the >>> newer version without preallocation for:1)less effort in programming and >>> also nearly the same good performance 2) good memory usage(I see no >>> unneeded memory after assembly) 3) dedicated preallocation is usually not >>> very easy and cause extra time cost. >>> Maybe it will be better that leave some space for the user to do a >>> slight direction for the preallocation and thus acquire better performance. >>> But have no idea how to direct it. >>> And I am very curious about how petsc achieves this. How can it not >>> know anything but achieve so good performance, and no wasted memory? May >>> you have an explanation about this? >>> >> >> We use a hash table to store the nonzeros on the fly, and then convert to >> packed storage on assembly. >> >> Thanks, >> >> Matt >> >> >>> assemble time: >>> version\processors 4 8 16 32 >>> 64 128 256 >>> old 14677s 4694s 1124s 572s >>> 38s 8s 2s >>> new 50s 28s 15s >>> 7.8s 4s 2s 0.4s >>> older 27s 24s 19s >>> 12s 14s - - >>> stash time(max among all processors): >>> version\processors 4 8 16 32 >>> 64 128 256 >>> old 3145s 2554s 673s 329s >>> 201s 142s 138s >>> new 2s 1s ~0s >>> ~0s ~0s ~0s ~0s >>> older 10s 73s 18s >>> 5s 1s - - >>> old: my poor preallocation >>> new: newest version of petsc that do not preallocation >>> older: the best preallocation version of my code. >>> >>> >>> Runfeng >>> >>> Barry Smith <[email protected]> 于2023年7月3日周一 12:19写道: >>> >>>> >>>> The main branch of PETSc now supports filling sparse matrices >>>> without providing any preallocation information. >>>> >>>> You can give it a try. Use your current fastest code but just remove >>>> ALL the preallocation calls. I would be interested in what kind of >>>> performance you get compared to your best current performance. >>>> >>>> Barry >>>> >>>> >>>> >>>> On Jul 2, 2023, at 11:24 PM, Runfeng Jin <[email protected]> wrote: >>>> >>>> Hi! Good advice! >>>> I set value with MatSetValues() API, which sets one part of a row >>>> at a time(I use a kind of tiling technology so I cannot get all values of a >>>> row at a time). >>>> I tested the number of malloc in these three cases. The number of >>>> mallocs is decreasing with the increase of processors, and all these are >>>> very large(the matrix is 283234*283234, as can see in the following). This >>>> should be due to the unqualified preallocation. I use a rough >>>> preallocation, that every processor counts the number of nonzero elements >>>> for the first 10 rows, and uses the largest one to preallocate memory for >>>> all local rows. It seems that not work well. >>>> >>>> number_of_processors number_of_max_mallocs_among_all_processors >>>> 64 20000 >>>> 128 17000 >>>> 256 11000 >>>> >>>> I change my way to preallocate. I evenly take 100 rows in every >>>> local matrix and take the largest one to preallocate memory for all local >>>> rows. Now the assemble time is reduced to a very small time. >>>> >>>> number_of_processors number_of_max_mallocs_among_all_processors >>>> 64 3000 >>>> 128 700 >>>> 256 500 >>>> >>>> Event Count Time (sec) Flop >>>> --- Global --- >>>> --- Stage ---- Total >>>> Max Ratio Max Ratio Max Ratio >>>> Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R >>>> Mflop/s >>>> 64 1 1.0 3.8999e+01 1.0 0.00e+00 0.0 >>>> 7.1e+03 2.9e+05 1.1e+01 15 0 1 8 3 15 0 1 8 >>>> 3 0 >>>> >>>> 128 1 1.0 8.5714e+00 1.0 0.00e+00 0.0 >>>> 2.6e+04 8.1e+04 1.1e+01 5 0 1 4 3 5 0 1 4 >>>> 3 0 >>>> 256 1 1.0 2.5512e+00 1.0 0.00e+00 0.0 >>>> 1.0e+05 2.3e+04 1.1e+01 2 0 1 3 3 2 0 1 3 >>>> 3 0 >>>> >>>> So the reason "why assemble time is smaller with the increasing number >>>> of processors " may be because more processors divide the malloc job so >>>> that total time is reduced? >>>> If so, I still have some questions: >>>> 1. If preallocation is not accurate, will the performance of the >>>> assembly be affected? I mean, when processors receive the elements that >>>> should be stored in their local by MPI, then will the new mallocs happen >>>> at this time point? >>>> 2. I can not give an accurate preallocation for the large cost, so >>>> is there any better way to preallocate for my situation? >>>> >>>> >>>> >>>> Barry Smith <[email protected]> 于2023年7月2日周日 00:16写道: >>>> >>>>> >>>>> I see no reason not to trust the times below, they seem reasonable. >>>>> You get more than 2 times speed from 64 to 128 and then about 1.38 from >>>>> 128 >>>>> to 256. >>>>> >>>>> The total amount of data moved (number of messages moved times >>>>> average length) goes from 7.0e+03 * 2.8e+05 1.9600e+09 to 2.1060e+09 >>>>> to 2.3000e+09. A pretty moderate amount of data increase, but note that >>>>> each time you double the number of ranks, you also increase substantially >>>>> the network's hardware to move data, so one would hope for a good speed >>>>> up. >>>>> >>>>> Also, the load balance is very good, near 1. Often with assembly, >>>>> we see very out-of-balance, and it is difficult to get good speedup when >>>>> the balance is really off. >>>>> >>>>> It looks like over 90% of the entire run time is coming from >>>>> setting and assembling the values? Also the setting values time dominates >>>>> assembly time more with more ranks. Are you setting a single value at a >>>>> time or a collection of them? How big are the vectors? >>>>> >>>>> Run all three cases with -info :vec to see some information about >>>>> how many mallocs where move to hold the stashed vector entries. >>>>> >>>>> >>>>> >>>>> >>>>> On Jun 30, 2023, at 10:25 PM, Runfeng Jin <[email protected]> wrote: >>>>> >>>>> >>>>> >>>>> Hi, >>>>> Thanks for your reply. I try to use PetscLogEvent(), and the >>>>> result shows same conclusion. >>>>> What I have done is : >>>>> ---------------- >>>>> PetscLogEvent Mat_assemble_event, Mat_setvalue_event, >>>>> Mat_setAsse_event; >>>>> PetscClassId classid; >>>>> PetscLogDouble user_event_flops; >>>>> PetscClassIdRegister("Test assemble and set value", &classid); >>>>> PetscLogEventRegister("Test only assemble", classid, >>>>> &Mat_assemble_event); >>>>> PetscLogEventRegister("Test only set values", classid, >>>>> &Mat_setvalue_event); >>>>> PetscLogEventRegister("Test both assemble and set values", >>>>> classid, &Mat_setAsse_event); >>>>> PetscLogEventBegin(Mat_setAsse_event, 0, 0, 0, 0); >>>>> PetscLogEventBegin(Mat_setvalue_event, 0, 0, 0, 0); >>>>> ...compute elements and use MatSetValues. No call for assembly >>>>> PetscLogEventEnd(Mat_setvalue_event, 0, 0, 0, 0); >>>>> >>>>> PetscLogEventBegin(Mat_assemble_event, 0, 0, 0, 0); >>>>> MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY); >>>>> MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY); >>>>> PetscLogEventEnd(Mat_assemble_event, 0, 0, 0, 0); >>>>> PetscLogEventEnd(Mat_setAsse_event, 0, 0, 0, 0); >>>>> ---------------- >>>>> >>>>> And the output as follows. By the way, dose petsc recorde all time >>>>> between PetscLogEventBegin and PetscLogEventEnd? or just test the time of >>>>> petsc API? >>>>> >>>>> >>>>> It is all of the time. >>>>> >>>>> ---------------- >>>>> Event Count Time (sec) Flop >>>>> --- Global --- --- Stage ---- Total >>>>> Max Ratio *Max* Ratio Max Ratio Mess >>>>> AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s >>>>> 64new 1 1.0 *2.3775e+02* 1.0 0.00e+00 0.0 6.2e+03 >>>>> 2.3e+04 9.0e+00 52 0 1 1 2 52 0 1 1 2 0 >>>>> 128new 1 1.0* 6.9945e+01* 1.0 0.00e+00 0.0 2.5e+04 >>>>> 1.1e+04 9.0e+00 30 0 1 1 2 30 0 1 1 2 0 >>>>> 256new 1 1.0 *1.7445e+01* 1.0 0.00e+00 0.0 9.9e+04 >>>>> 5.2e+03 9.0e+00 10 0 1 1 2 10 0 1 1 2 0 >>>>> >>>>> 64: >>>>> only assemble 1 1.0 *2.6596e+02 *1.0 0.00e+00 0.0 7.0e+03 >>>>> 2.8e+05 1.1e+01 55 0 1 8 3 55 0 1 8 3 0 >>>>> only setvalues 1 1.0 *1.9987e+02* 1.0 0.00e+00 0.0 0.0e+00 >>>>> 0.0e+00 0.0e+00 41 0 0 0 0 41 0 0 0 0 0 >>>>> Test both 1 1.0 4.*6580e+02* 1.0 0.00e+00 0.0 7.0e+03 >>>>> 2.8e+05 1.5e+01 96 0 1 8 4 96 0 1 8 4 0 >>>>> >>>>> 128: >>>>> only assemble 1 1.0 *6.9718e+01* 1.0 0.00e+00 0.0 2.6e+04 >>>>> 8.1e+04 1.1e+01 30 0 1 4 3 30 0 1 4 3 0 >>>>> only setvalues 1 1.0 *1.4438e+02* 1.1 0.00e+00 0.0 0.0e+00 >>>>> 0.0e+00 0.0e+00 60 0 0 0 0 60 0 0 0 0 0 >>>>> Test both 1 1.0 *2.1417e+02* 1.0 0.00e+00 0.0 2.6e+04 >>>>> 8.1e+04 1.5e+01 91 0 1 4 4 91 0 1 4 4 0 >>>>> >>>>> 256: >>>>> only assemble 1 1.0 *1.7482e+01* 1.0 0.00e+00 0.0 1.0e+05 >>>>> 2.3e+04 1.1e+01 10 0 1 3 3 10 0 1 3 3 0 >>>>> only setvalues 1 1.0 *1.3717e+02* 1.1 0.00e+00 0.0 0.0e+00 >>>>> 0.0e+00 0.0e+00 78 0 0 0 0 78 0 0 0 0 0 >>>>> Test both 1 1.0 *1.5475e+02* 1.0 0.00e+00 0.0 1.0e+05 >>>>> 2.3e+04 1.5e+01 91 0 1 3 4 91 0 1 3 4 0 >>>>> >>>>> >>>>> >>>>> Runfeng >>>>> >>>>> Barry Smith <[email protected]> 于2023年6月30日周五 23:35写道: >>>>> >>>>>> >>>>>> You cannot look just at the VecAssemblyEnd() time, that will very >>>>>> likely give the wrong impression of the total time it takes to put the >>>>>> values in. >>>>>> >>>>>> You need to register a new Event and put a PetscLogEvent() just >>>>>> before you start generating the vector entries and calling VecSetValues() >>>>>> and put the PetscLogEventEnd() just after the VecAssemblyEnd() this is >>>>>> the >>>>>> only way to get an accurate accounting of the time. >>>>>> >>>>>> Barry >>>>>> >>>>>> >>>>>> > On Jun 30, 2023, at 11:21 AM, Runfeng Jin <[email protected]> >>>>>> wrote: >>>>>> > >>>>>> > Hello! >>>>>> > >>>>>> > When I use PETSc build a sbaij matrix, I find a strange thing. When >>>>>> I increase the number of processors, the assemble time become smaller. >>>>>> All >>>>>> these are totally same matrix. The assemble time mainly arouse from >>>>>> message >>>>>> passing, which because I use dynamic workload that it is random for which >>>>>> elements are computed by which processor. >>>>>> > But from instinct, if use more processors, then more possible that >>>>>> the processor computes elements storing in other processors. But from the >>>>>> output of log_view, It seems when use more processors, the processors >>>>>> compute more elements storing in its local(infer from that, with more >>>>>> processors, less total amount of passed messages). >>>>>> > >>>>>> > What could cause this happened? Thank you! >>>>>> > >>>>>> > >>>>>> > Following is the output of log_view for 64\128\256 processors. >>>>>> Every row is time profiler of VecAssemblyEnd. >>>>>> > >>>>>> > >>>>>> ------------------------------------------------------------------------------------------------------------------------ >>>>>> > processors Count Time (sec) >>>>>> Flop >>>>>> --- Global --- --- >>>>>> Stage ---- Total >>>>>> > Max Ratio Max >>>>>> Ratio Max Ratio Mess AvgLen >>>>>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s >>>>>> > 64 1 1.0 2.3775e+02 >>>>>> 1.0 0.00e+00 0.0 6.2e+03 2.3e+04 9.0e+00 >>>>>> 52 0 1 1 2 52 0 1 1 2 >>>>>> 0 >>>>>> > 128 1 1.0 6.9945e+01 >>>>>> 1.0 0.00e+00 0.0 2.5e+04 1.1e+04 9.0e+00 >>>>>> 30 0 1 1 2 30 0 1 1 2 >>>>>> 0 >>>>>> > 256 1 1.0 1.7445e+01 >>>>>> 1.0 0.00e+00 0.0 9.9e+04 5.2e+03 9.0e+00 >>>>>> 10 0 1 1 2 10 0 1 1 2 >>>>>> 0 >>>>>> > >>>>>> > Runfeng Jin >>>>>> >>>>>> >>>>> >>>> >> >> -- >> What most experimenters take for granted before they begin their >> experiments is infinitely more interesting than any results to which their >> experiments lead. >> -- Norbert Wiener >> >> https://www.cse.buffalo.edu/~knepley/ >> <http://www.cse.buffalo.edu/~knepley/> >> >
