Thank you for all your help! Runfeng
Matthew Knepley <[email protected]> 于2023年7月3日周一 22:03写道: > On Mon, Jul 3, 2023 at 9:56 AM Runfeng Jin <[email protected]> wrote: > >> Hi, impressive performance! >> I use the newest version of petsc(release branch), and it almost >> deletes all assembly and stash time in large processors (assembly time >> 64-4s/128-2s/256-0.2s, stash time all below 2s). For the zero programming >> cost, it really incredible. >> The order code has a regular arrangement of the number of >> nonzero-elements across rows, so I can have a good rough preallocation. And >> from the data, dedicatedly arrange data and roughly acquiring the max >> number of non-zero elements in rows can have a better performance than the >> new version without preallocation. However, in reality, I will use the >> newer version without preallocation for:1)less effort in programming and >> also nearly the same good performance 2) good memory usage(I see no >> unneeded memory after assembly) 3) dedicated preallocation is usually not >> very easy and cause extra time cost. >> Maybe it will be better that leave some space for the user to do a >> slight direction for the preallocation and thus acquire better performance. >> But have no idea how to direct it. >> And I am very curious about how petsc achieves this. How can it not >> know anything but achieve so good performance, and no wasted memory? May >> you have an explanation about this? >> > > We use a hash table to store the nonzeros on the fly, and then convert to > packed storage on assembly. > > Thanks, > > Matt > > >> assemble time: >> version\processors 4 8 16 32 >> 64 128 256 >> old 14677s 4694s 1124s 572s >> 38s 8s 2s >> new 50s 28s 15s >> 7.8s 4s 2s 0.4s >> older 27s 24s 19s >> 12s 14s - - >> stash time(max among all processors): >> version\processors 4 8 16 32 >> 64 128 256 >> old 3145s 2554s 673s 329s >> 201s 142s 138s >> new 2s 1s ~0s >> ~0s ~0s ~0s ~0s >> older 10s 73s 18s >> 5s 1s - - >> old: my poor preallocation >> new: newest version of petsc that do not preallocation >> older: the best preallocation version of my code. >> >> >> Runfeng >> >> Barry Smith <[email protected]> 于2023年7月3日周一 12:19写道: >> >>> >>> The main branch of PETSc now supports filling sparse matrices without >>> providing any preallocation information. >>> >>> You can give it a try. Use your current fastest code but just remove >>> ALL the preallocation calls. I would be interested in what kind of >>> performance you get compared to your best current performance. >>> >>> Barry >>> >>> >>> >>> On Jul 2, 2023, at 11:24 PM, Runfeng Jin <[email protected]> wrote: >>> >>> Hi! Good advice! >>> I set value with MatSetValues() API, which sets one part of a row at >>> a time(I use a kind of tiling technology so I cannot get all values of a >>> row at a time). >>> I tested the number of malloc in these three cases. The number of >>> mallocs is decreasing with the increase of processors, and all these are >>> very large(the matrix is 283234*283234, as can see in the following). This >>> should be due to the unqualified preallocation. I use a rough >>> preallocation, that every processor counts the number of nonzero elements >>> for the first 10 rows, and uses the largest one to preallocate memory for >>> all local rows. It seems that not work well. >>> >>> number_of_processors number_of_max_mallocs_among_all_processors >>> 64 20000 >>> 128 17000 >>> 256 11000 >>> >>> I change my way to preallocate. I evenly take 100 rows in every >>> local matrix and take the largest one to preallocate memory for all local >>> rows. Now the assemble time is reduced to a very small time. >>> >>> number_of_processors number_of_max_mallocs_among_all_processors >>> 64 3000 >>> 128 700 >>> 256 500 >>> >>> Event Count Time (sec) Flop >>> --- Global --- >>> --- Stage ---- Total >>> Max Ratio Max Ratio Max Ratio >>> Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R >>> Mflop/s >>> 64 1 1.0 3.8999e+01 1.0 0.00e+00 0.0 >>> 7.1e+03 2.9e+05 1.1e+01 15 0 1 8 3 15 0 1 8 >>> 3 0 >>> >>> 128 1 1.0 8.5714e+00 1.0 0.00e+00 0.0 >>> 2.6e+04 8.1e+04 1.1e+01 5 0 1 4 3 5 0 1 4 >>> 3 0 >>> 256 1 1.0 2.5512e+00 1.0 0.00e+00 0.0 >>> 1.0e+05 2.3e+04 1.1e+01 2 0 1 3 3 2 0 1 3 >>> 3 0 >>> >>> So the reason "why assemble time is smaller with the increasing number >>> of processors " may be because more processors divide the malloc job so >>> that total time is reduced? >>> If so, I still have some questions: >>> 1. If preallocation is not accurate, will the performance of the >>> assembly be affected? I mean, when processors receive the elements that >>> should be stored in their local by MPI, then will the new mallocs happen >>> at this time point? >>> 2. I can not give an accurate preallocation for the large cost, so >>> is there any better way to preallocate for my situation? >>> >>> >>> >>> Barry Smith <[email protected]> 于2023年7月2日周日 00:16写道: >>> >>>> >>>> I see no reason not to trust the times below, they seem reasonable. >>>> You get more than 2 times speed from 64 to 128 and then about 1.38 from 128 >>>> to 256. >>>> >>>> The total amount of data moved (number of messages moved times >>>> average length) goes from 7.0e+03 * 2.8e+05 1.9600e+09 to 2.1060e+09 >>>> to 2.3000e+09. A pretty moderate amount of data increase, but note that >>>> each time you double the number of ranks, you also increase substantially >>>> the network's hardware to move data, so one would hope for a good speed up. >>>> >>>> Also, the load balance is very good, near 1. Often with assembly, we >>>> see very out-of-balance, and it is difficult to get good speedup when the >>>> balance is really off. >>>> >>>> It looks like over 90% of the entire run time is coming from setting >>>> and assembling the values? Also the setting values time dominates assembly >>>> time more with more ranks. Are you setting a single value at a time or a >>>> collection of them? How big are the vectors? >>>> >>>> Run all three cases with -info :vec to see some information about >>>> how many mallocs where move to hold the stashed vector entries. >>>> >>>> >>>> >>>> >>>> On Jun 30, 2023, at 10:25 PM, Runfeng Jin <[email protected]> wrote: >>>> >>>> >>>> >>>> Hi, >>>> Thanks for your reply. I try to use PetscLogEvent(), and the result >>>> shows same conclusion. >>>> What I have done is : >>>> ---------------- >>>> PetscLogEvent Mat_assemble_event, Mat_setvalue_event, >>>> Mat_setAsse_event; >>>> PetscClassId classid; >>>> PetscLogDouble user_event_flops; >>>> PetscClassIdRegister("Test assemble and set value", &classid); >>>> PetscLogEventRegister("Test only assemble", classid, >>>> &Mat_assemble_event); >>>> PetscLogEventRegister("Test only set values", classid, >>>> &Mat_setvalue_event); >>>> PetscLogEventRegister("Test both assemble and set values", classid, >>>> &Mat_setAsse_event); >>>> PetscLogEventBegin(Mat_setAsse_event, 0, 0, 0, 0); >>>> PetscLogEventBegin(Mat_setvalue_event, 0, 0, 0, 0); >>>> ...compute elements and use MatSetValues. No call for assembly >>>> PetscLogEventEnd(Mat_setvalue_event, 0, 0, 0, 0); >>>> >>>> PetscLogEventBegin(Mat_assemble_event, 0, 0, 0, 0); >>>> MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY); >>>> MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY); >>>> PetscLogEventEnd(Mat_assemble_event, 0, 0, 0, 0); >>>> PetscLogEventEnd(Mat_setAsse_event, 0, 0, 0, 0); >>>> ---------------- >>>> >>>> And the output as follows. By the way, dose petsc recorde all time >>>> between PetscLogEventBegin and PetscLogEventEnd? or just test the time of >>>> petsc API? >>>> >>>> >>>> It is all of the time. >>>> >>>> ---------------- >>>> Event Count Time (sec) Flop >>>> --- Global --- --- Stage ---- Total >>>> Max Ratio *Max* Ratio Max Ratio Mess >>>> AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s >>>> 64new 1 1.0 *2.3775e+02* 1.0 0.00e+00 0.0 6.2e+03 >>>> 2.3e+04 9.0e+00 52 0 1 1 2 52 0 1 1 2 0 >>>> 128new 1 1.0* 6.9945e+01* 1.0 0.00e+00 0.0 2.5e+04 >>>> 1.1e+04 9.0e+00 30 0 1 1 2 30 0 1 1 2 0 >>>> 256new 1 1.0 *1.7445e+01* 1.0 0.00e+00 0.0 9.9e+04 >>>> 5.2e+03 9.0e+00 10 0 1 1 2 10 0 1 1 2 0 >>>> >>>> 64: >>>> only assemble 1 1.0 *2.6596e+02 *1.0 0.00e+00 0.0 7.0e+03 >>>> 2.8e+05 1.1e+01 55 0 1 8 3 55 0 1 8 3 0 >>>> only setvalues 1 1.0 *1.9987e+02* 1.0 0.00e+00 0.0 0.0e+00 >>>> 0.0e+00 0.0e+00 41 0 0 0 0 41 0 0 0 0 0 >>>> Test both 1 1.0 4.*6580e+02* 1.0 0.00e+00 0.0 7.0e+03 >>>> 2.8e+05 1.5e+01 96 0 1 8 4 96 0 1 8 4 0 >>>> >>>> 128: >>>> only assemble 1 1.0 *6.9718e+01* 1.0 0.00e+00 0.0 2.6e+04 >>>> 8.1e+04 1.1e+01 30 0 1 4 3 30 0 1 4 3 0 >>>> only setvalues 1 1.0 *1.4438e+02* 1.1 0.00e+00 0.0 0.0e+00 >>>> 0.0e+00 0.0e+00 60 0 0 0 0 60 0 0 0 0 0 >>>> Test both 1 1.0 *2.1417e+02* 1.0 0.00e+00 0.0 2.6e+04 >>>> 8.1e+04 1.5e+01 91 0 1 4 4 91 0 1 4 4 0 >>>> >>>> 256: >>>> only assemble 1 1.0 *1.7482e+01* 1.0 0.00e+00 0.0 1.0e+05 >>>> 2.3e+04 1.1e+01 10 0 1 3 3 10 0 1 3 3 0 >>>> only setvalues 1 1.0 *1.3717e+02* 1.1 0.00e+00 0.0 0.0e+00 >>>> 0.0e+00 0.0e+00 78 0 0 0 0 78 0 0 0 0 0 >>>> Test both 1 1.0 *1.5475e+02* 1.0 0.00e+00 0.0 1.0e+05 >>>> 2.3e+04 1.5e+01 91 0 1 3 4 91 0 1 3 4 0 >>>> >>>> >>>> >>>> Runfeng >>>> >>>> Barry Smith <[email protected]> 于2023年6月30日周五 23:35写道: >>>> >>>>> >>>>> You cannot look just at the VecAssemblyEnd() time, that will very >>>>> likely give the wrong impression of the total time it takes to put the >>>>> values in. >>>>> >>>>> You need to register a new Event and put a PetscLogEvent() just >>>>> before you start generating the vector entries and calling VecSetValues() >>>>> and put the PetscLogEventEnd() just after the VecAssemblyEnd() this is the >>>>> only way to get an accurate accounting of the time. >>>>> >>>>> Barry >>>>> >>>>> >>>>> > On Jun 30, 2023, at 11:21 AM, Runfeng Jin <[email protected]> >>>>> wrote: >>>>> > >>>>> > Hello! >>>>> > >>>>> > When I use PETSc build a sbaij matrix, I find a strange thing. When >>>>> I increase the number of processors, the assemble time become smaller. All >>>>> these are totally same matrix. The assemble time mainly arouse from >>>>> message >>>>> passing, which because I use dynamic workload that it is random for which >>>>> elements are computed by which processor. >>>>> > But from instinct, if use more processors, then more possible that >>>>> the processor computes elements storing in other processors. But from the >>>>> output of log_view, It seems when use more processors, the processors >>>>> compute more elements storing in its local(infer from that, with more >>>>> processors, less total amount of passed messages). >>>>> > >>>>> > What could cause this happened? Thank you! >>>>> > >>>>> > >>>>> > Following is the output of log_view for 64\128\256 processors. >>>>> Every row is time profiler of VecAssemblyEnd. >>>>> > >>>>> > >>>>> ------------------------------------------------------------------------------------------------------------------------ >>>>> > processors Count Time (sec) >>>>> Flop >>>>> --- Global --- --- >>>>> Stage ---- Total >>>>> > Max Ratio Max >>>>> Ratio Max Ratio Mess AvgLen >>>>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s >>>>> > 64 1 1.0 2.3775e+02 >>>>> 1.0 0.00e+00 0.0 6.2e+03 2.3e+04 9.0e+00 >>>>> 52 0 1 1 2 52 0 1 1 2 >>>>> 0 >>>>> > 128 1 1.0 6.9945e+01 >>>>> 1.0 0.00e+00 0.0 2.5e+04 1.1e+04 9.0e+00 >>>>> 30 0 1 1 2 30 0 1 1 2 >>>>> 0 >>>>> > 256 1 1.0 1.7445e+01 >>>>> 1.0 0.00e+00 0.0 9.9e+04 5.2e+03 9.0e+00 >>>>> 10 0 1 1 2 10 0 1 1 2 >>>>> 0 >>>>> > >>>>> > Runfeng Jin >>>>> >>>>> >>>> >>> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > <http://www.cse.buffalo.edu/~knepley/> >
