The main branch of PETSc now supports filling sparse matrices without providing any preallocation information.
You can give it a try. Use your current fastest code but just remove ALL the preallocation calls. I would be interested in what kind of performance you get compared to your best current performance. Barry > On Jul 2, 2023, at 11:24 PM, Runfeng Jin <[email protected]> wrote: > > Hi! Good advice! > I set value with MatSetValues() API, which sets one part of a row at a > time(I use a kind of tiling technology so I cannot get all values of a row at > a time). > I tested the number of malloc in these three cases. The number of > mallocs is decreasing with the increase of processors, and all these are very > large(the matrix is 283234*283234, as can see in the following). This should > be due to the unqualified preallocation. I use a rough preallocation, that > every processor counts the number of nonzero elements for the first 10 rows, > and uses the largest one to preallocate memory for all local rows. It seems > that not work well. > number_of_processors number_of_max_mallocs_among_all_processors > 64 20000 > 128 17000 > 256 11000 > I change my way to preallocate. I evenly take 100 rows in every local > matrix and take the largest one to preallocate memory for all local rows. Now > the assemble time is reduced to a very small time. > number_of_processors number_of_max_mallocs_among_all_processors > 64 3000 > 128 700 > 256 500 > Event Count Time (sec) Flop > --- Global --- > --- Stage ---- Total > Max Ratio Max Ratio Max Ratio Mess > AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s > 64 1 1.0 3.8999e+01 1.0 0.00e+00 0.0 7.1e+03 > 2.9e+05 1.1e+01 15 0 1 8 3 15 0 1 8 3 > 0 > 128 1 1.0 8.5714e+00 1.0 0.00e+00 0.0 2.6e+04 > 8.1e+04 1.1e+01 5 0 1 4 3 5 0 1 4 3 > 0 > 256 1 1.0 2.5512e+00 1.0 0.00e+00 0.0 1.0e+05 > 2.3e+04 1.1e+01 2 0 1 3 3 2 0 1 3 3 > 0 > So the reason "why assemble time is smaller with the increasing number of > processors " may be because more processors divide the malloc job so that > total time is reduced? > If so, I still have some questions: > 1. If preallocation is not accurate, will the performance of the assembly > be affected? I mean, when processors receive the elements that should be > stored in their local by MPI, then will the new mallocs happen at this time > point? > 2. I can not give an accurate preallocation for the large cost, so is > there any better way to preallocate for my situation? > > > Barry Smith <[email protected] <mailto:[email protected]>> 于2023年7月2日周日 00:16写道: >> >> I see no reason not to trust the times below, they seem reasonable. You >> get more than 2 times speed from 64 to 128 and then about 1.38 from 128 to >> 256. >> >> The total amount of data moved (number of messages moved times average >> length) goes from 7.0e+03 * 2.8e+05 1.9600e+09 to 2.1060e+09 to 2.3000e+09. >> A pretty moderate amount of data increase, but note that each time you >> double the number of ranks, you also increase substantially the network's >> hardware to move data, so one would hope for a good speed up. >> >> Also, the load balance is very good, near 1. Often with assembly, we see >> very out-of-balance, and it is difficult to get good speedup when the >> balance is really off. >> >> It looks like over 90% of the entire run time is coming from setting and >> assembling the values? Also the setting values time dominates assembly time >> more with more ranks. Are you setting a single value at a time or a >> collection of them? How big are the vectors? >> >> Run all three cases with -info :vec to see some information about how >> many mallocs where move to hold the stashed vector entries. >> >> >> >> >>> On Jun 30, 2023, at 10:25 PM, Runfeng Jin <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> >>> >>> Hi, >>> Thanks for your reply. I try to use PetscLogEvent(), and the result >>> shows same conclusion. >>> What I have done is : >>> ---------------- >>> PetscLogEvent Mat_assemble_event, Mat_setvalue_event, Mat_setAsse_event; >>> PetscClassId classid; >>> PetscLogDouble user_event_flops; >>> PetscClassIdRegister("Test assemble and set value", &classid); >>> PetscLogEventRegister("Test only assemble", classid, >>> &Mat_assemble_event); >>> PetscLogEventRegister("Test only set values", classid, >>> &Mat_setvalue_event); >>> PetscLogEventRegister("Test both assemble and set values", classid, >>> &Mat_setAsse_event); >>> PetscLogEventBegin(Mat_setAsse_event, 0, 0, 0, 0); >>> PetscLogEventBegin(Mat_setvalue_event, 0, 0, 0, 0); >>> ...compute elements and use MatSetValues. No call for assembly >>> PetscLogEventEnd(Mat_setvalue_event, 0, 0, 0, 0); >>> >>> PetscLogEventBegin(Mat_assemble_event, 0, 0, 0, 0); >>> MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY); >>> MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY); >>> PetscLogEventEnd(Mat_assemble_event, 0, 0, 0, 0); >>> PetscLogEventEnd(Mat_setAsse_event, 0, 0, 0, 0); >>> ---------------- >>> >>> And the output as follows. By the way, dose petsc recorde all time >>> between PetscLogEventBegin and PetscLogEventEnd? or just test the time of >>> petsc API? >> >> It is all of the time. >> >>> ---------------- >>> Event Count Time (sec) Flop >>> --- Global --- --- Stage ---- Total >>> Max Ratio Max Ratio Max Ratio Mess AvgLen >>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s >>> 64new 1 1.0 2.3775e+02 1.0 0.00e+00 0.0 6.2e+03 2.3e+04 >>> 9.0e+00 52 0 1 1 2 52 0 1 1 2 0 >>> 128new 1 1.0 6.9945e+01 1.0 0.00e+00 0.0 2.5e+04 1.1e+04 >>> 9.0e+00 30 0 1 1 2 30 0 1 1 2 0 >>> 256new 1 1.0 1.7445e+01 1.0 0.00e+00 0.0 9.9e+04 5.2e+03 >>> 9.0e+00 10 0 1 1 2 10 0 1 1 2 0 >>> >>> 64: >>> only assemble 1 1.0 2.6596e+02 1.0 0.00e+00 0.0 7.0e+03 2.8e+05 >>> 1.1e+01 55 0 1 8 3 55 0 1 8 3 0 >>> only setvalues 1 1.0 1.9987e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 41 0 0 0 0 41 0 0 0 0 0 >>> Test both 1 1.0 4.6580e+02 1.0 0.00e+00 0.0 7.0e+03 2.8e+05 >>> 1.5e+01 96 0 1 8 4 96 0 1 8 4 0 >>> >>> 128: >>> only assemble 1 1.0 6.9718e+01 1.0 0.00e+00 0.0 2.6e+04 8.1e+04 >>> 1.1e+01 30 0 1 4 3 30 0 1 4 3 0 >>> only setvalues 1 1.0 1.4438e+02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 60 0 0 0 0 60 0 0 0 0 0 >>> Test both 1 1.0 2.1417e+02 1.0 0.00e+00 0.0 2.6e+04 8.1e+04 >>> 1.5e+01 91 0 1 4 4 91 0 1 4 4 0 >>> >>> 256: >>> only assemble 1 1.0 1.7482e+01 1.0 0.00e+00 0.0 1.0e+05 2.3e+04 >>> 1.1e+01 10 0 1 3 3 10 0 1 3 3 0 >>> only setvalues 1 1.0 1.3717e+02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 78 0 0 0 0 78 0 0 0 0 0 >>> Test both 1 1.0 1.5475e+02 1.0 0.00e+00 0.0 1.0e+05 2.3e+04 >>> 1.5e+01 91 0 1 3 4 91 0 1 3 4 0 >>> >>> >>> >>> Runfeng >>> >>> Barry Smith <[email protected] <mailto:[email protected]>> 于2023年6月30日周五 >>> 23:35写道: >>>> >>>> You cannot look just at the VecAssemblyEnd() time, that will very >>>> likely give the wrong impression of the total time it takes to put the >>>> values in. >>>> >>>> You need to register a new Event and put a PetscLogEvent() just before >>>> you start generating the vector entries and calling VecSetValues() and put >>>> the PetscLogEventEnd() just after the VecAssemblyEnd() this is the only >>>> way to get an accurate accounting of the time. >>>> >>>> Barry >>>> >>>> >>>> > On Jun 30, 2023, at 11:21 AM, Runfeng Jin <[email protected] >>>> > <mailto:[email protected]>> wrote: >>>> > >>>> > Hello! >>>> > >>>> > When I use PETSc build a sbaij matrix, I find a strange thing. When I >>>> > increase the number of processors, the assemble time become smaller. All >>>> > these are totally same matrix. The assemble time mainly arouse from >>>> > message passing, which because I use dynamic workload that it is random >>>> > for which elements are computed by which processor. >>>> > But from instinct, if use more processors, then more possible that the >>>> > processor computes elements storing in other processors. But from the >>>> > output of log_view, It seems when use more processors, the processors >>>> > compute more elements storing in its local(infer from that, with more >>>> > processors, less total amount of passed messages). >>>> > >>>> > What could cause this happened? Thank you! >>>> > >>>> > >>>> > Following is the output of log_view for 64\128\256 processors. Every >>>> > row is time profiler of VecAssemblyEnd. >>>> > >>>> > ------------------------------------------------------------------------------------------------------------------------ >>>> > processors Count Time (sec) >>>> > Flop >>>> > --- Global --- --- >>>> > Stage ---- Total >>>> > Max Ratio Max >>>> > Ratio Max Ratio Mess AvgLen Reduct >>>> > %T %F %M %L %R %T %F %M %L %R Mflop/s >>>> > 64 1 1.0 2.3775e+02 1.0 >>>> > 0.00e+00 0.0 6.2e+03 2.3e+04 9.0e+00 >>>> > 52 0 1 1 2 52 0 1 1 2 >>>> > 0 >>>> > 128 1 1.0 6.9945e+01 1.0 >>>> > 0.00e+00 0.0 2.5e+04 1.1e+04 9.0e+00 >>>> > 30 0 1 1 2 30 0 1 1 2 >>>> > 0 >>>> > 256 1 1.0 1.7445e+01 1.0 >>>> > 0.00e+00 0.0 9.9e+04 5.2e+03 9.0e+00 >>>> > 10 0 1 1 2 10 0 1 1 2 >>>> > 0 >>>> > >>>> > Runfeng Jin >>>> >>
