Re: [petsc-users] Smaller assemble time with increasing processors

Barry Smith Sun, 02 Jul 2023 21:19:28 -0700

   The main branch of PETSc now supports filling sparse matrices without 
providing any preallocation information.


   You can give it a try. Use your current fastest code but just remove ALL the 
preallocation calls. I would be interested in what kind of performance you get 
compared to your best current performance.

  Barry



> On Jul 2, 2023, at 11:24 PM, Runfeng Jin <[email protected]> wrote:
> 
> Hi! Good advice!
>     I set value with MatSetValues() API, which sets one part of a row at a 
> time(I use a kind of tiling technology so I cannot get all values of a row at 
> a time).
>     I tested the number of malloc in these three cases.  The number of 
> mallocs is decreasing with the increase of processors, and all these are very 
> large(the matrix is 283234*283234, as can see in the following). This should 
> be due to the unqualified preallocation. I use a rough preallocation, that 
> every processor counts the number of nonzero elements for the first 10 rows, 
> and uses the largest one to preallocate memory for all local rows. It seems 
> that not work well. 
> number_of_processors   number_of_max_mallocs_among_all_processors  
> 64                                     20000
> 128                                   17000
> 256                                   11000
>     I change my way to preallocate. I evenly take 100 rows in every local 
> matrix and take the largest one to preallocate memory for all local rows. Now 
> the assemble time is reduced to a very small time.
> number_of_processors   number_of_max_mallocs_among_all_processors  
> 64                                     3000
> 128                                   700
> 256                                   500
> Event                Count          Time (sec)            Flop                
>                                               --- Global ---                  
> --- Stage ----              Total
>                    Max Ratio        Max     Ratio       Max  Ratio  Mess      
>       AvgLen  Reduct  %T %F %M %L %R        %T %F %M %L %R    Mflop/s
> 64                 1    1.0       3.8999e+01 1.0     0.00e+00 0.0 7.1e+03     
> 2.9e+05 1.1e+01 15  0  1  8  3                     15  0  1  8  3             
>      0
> 128               1    1.0       8.5714e+00 1.0     0.00e+00 0.0 2.6e+04     
> 8.1e+04 1.1e+01  5  0  1  4  3                       5  0  1  4  3            
>        0
> 256               1    1.0        2.5512e+00 1.0    0.00e+00 0.0 1.0e+05     
> 2.3e+04 1.1e+01  2  0  1  3  3                       2  0  1  3  3            
>        0
> So the reason "why assemble time is smaller with the increasing number of 
> processors " may be because more processors divide the malloc job so that 
> total time is reduced? 
>  If so, I still have some questions:
>     1. If preallocation is not accurate, will the performance of the assembly 
> be affected? I mean, when processors receive the elements that should be 
> stored in their local by MPI, then will the new mallocs  happen at this time 
> point?
>     2. I can not give an accurate preallocation for the large cost, so is 
> there any better way to preallocate for my situation?
> 
> 
> Barry Smith <[email protected] <mailto:[email protected]>> 于2023年7月2日周日 00:16写道：
>> 
>>    I see no reason not to trust the times below, they seem reasonable. You 
>> get more than 2 times speed from 64 to 128 and then about 1.38 from 128 to 
>> 256. 
>> 
>>    The total amount of data moved (number of messages moved times average 
>> length) goes from 7.0e+03 * 2.8e+05  1.9600e+09 to 2.1060e+09 to 2.3000e+09. 
>> A pretty moderate amount of data increase, but note that each time you 
>> double the number of ranks, you also increase substantially the network's 
>> hardware to move data, so one would hope for a good speed up.
>> 
>>    Also, the load balance is very good, near 1. Often with assembly, we see 
>> very out-of-balance, and it is difficult to get good speedup when the 
>> balance is really off.
>> 
>>    It looks like over 90% of the entire run time is coming from setting and 
>> assembling the values? Also the setting values time dominates assembly time 
>> more with more ranks.  Are you setting a single value at a time or a 
>> collection of them? How big are the vectors?
>> 
>>    Run all three cases with -info :vec to see some information about how 
>> many mallocs where move to hold the stashed vector entries.
>> 
>> 
>> 
>> 
>>> On Jun 30, 2023, at 10:25 PM, Runfeng Jin <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> 
>>> 
>>> Hi, 
>>>     Thanks for your reply. I try to use PetscLogEvent(), and the result 
>>> shows same conclusion.
>>>     What I have done is :
>>> ----------------
>>>     PetscLogEvent Mat_assemble_event, Mat_setvalue_event, Mat_setAsse_event;
>>>     PetscClassId classid;
>>>     PetscLogDouble user_event_flops;
>>>     PetscClassIdRegister("Test assemble and set value", &classid);
>>>     PetscLogEventRegister("Test only assemble", classid, 
>>> &Mat_assemble_event);
>>>     PetscLogEventRegister("Test only set values", classid, 
>>> &Mat_setvalue_event);
>>>     PetscLogEventRegister("Test both assemble and set values", classid, 
>>> &Mat_setAsse_event);
>>>     PetscLogEventBegin(Mat_setAsse_event, 0, 0, 0, 0);
>>>     PetscLogEventBegin(Mat_setvalue_event, 0, 0, 0, 0);
>>>     ...compute elements and use MatSetValues. No call for assembly
>>>     PetscLogEventEnd(Mat_setvalue_event, 0, 0, 0, 0);
>>> 
>>>     PetscLogEventBegin(Mat_assemble_event, 0, 0, 0, 0);
>>>     MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY);
>>>     MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY);
>>>     PetscLogEventEnd(Mat_assemble_event, 0, 0, 0, 0);
>>>     PetscLogEventEnd(Mat_setAsse_event, 0, 0, 0, 0);
>>> ----------------
>>> 
>>>     And the output as follows. By the way, dose petsc recorde all time 
>>> between PetscLogEventBegin and PetscLogEventEnd? or just test the time of 
>>> petsc API?
>> 
>>    It is all of the time. 
>> 
>>> ----------------
>>> Event                Count      Time (sec)     Flop                         
>>>      --- Global ---  --- Stage ----  Total
>>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
>>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>> 64new               1 1.0 2.3775e+02 1.0 0.00e+00 0.0 6.2e+03 2.3e+04 
>>> 9.0e+00 52  0  1  1  2  52  0  1  1  2     0
>>> 128new              1 1.0 6.9945e+01 1.0 0.00e+00 0.0 2.5e+04 1.1e+04 
>>> 9.0e+00 30  0  1  1  2  30  0  1  1  2     0
>>> 256new              1 1.0 1.7445e+01 1.0 0.00e+00 0.0 9.9e+04 5.2e+03 
>>> 9.0e+00 10  0  1  1  2  10  0  1  1  2     0
>>> 
>>> 64:
>>> only assemble       1 1.0 2.6596e+02 1.0 0.00e+00 0.0 7.0e+03 2.8e+05 
>>> 1.1e+01 55  0  1  8  3  55  0  1  8  3     0
>>> only setvalues      1 1.0 1.9987e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00 41  0  0  0  0  41  0  0  0  0     0
>>> Test both           1 1.0 4.6580e+02 1.0 0.00e+00 0.0 7.0e+03 2.8e+05 
>>> 1.5e+01 96  0  1  8  4  96  0  1  8  4     0
>>> 
>>> 128:
>>>  only assemble      1 1.0 6.9718e+01 1.0 0.00e+00 0.0 2.6e+04 8.1e+04 
>>> 1.1e+01 30  0  1  4  3  30  0  1  4  3     0
>>> only setvalues      1 1.0 1.4438e+02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00 60  0  0  0  0  60  0  0  0  0     0
>>> Test both           1 1.0 2.1417e+02 1.0 0.00e+00 0.0 2.6e+04 8.1e+04 
>>> 1.5e+01 91  0  1  4  4  91  0  1  4  4     0
>>> 
>>> 256:
>>> only assemble       1 1.0 1.7482e+01 1.0 0.00e+00 0.0 1.0e+05 2.3e+04 
>>> 1.1e+01 10  0  1  3  3  10  0  1  3  3     0
>>> only setvalues      1 1.0 1.3717e+02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00 78  0  0  0  0  78  0  0  0  0     0
>>> Test both           1 1.0 1.5475e+02 1.0 0.00e+00 0.0 1.0e+05 2.3e+04 
>>> 1.5e+01 91  0  1  3  4  91  0  1  3  4     0 
>>> 
>>> 
>>> 
>>> Runfeng
>>> 
>>> Barry Smith <[email protected] <mailto:[email protected]>> 于2023年6月30日周五 
>>> 23:35写道：
>>>> 
>>>>    You cannot look just at the VecAssemblyEnd() time, that will very 
>>>> likely give the wrong impression of the total time it takes to put the 
>>>> values in.
>>>> 
>>>>    You need to register a new Event and put a PetscLogEvent() just before 
>>>> you start generating the vector entries and calling VecSetValues() and put 
>>>> the PetscLogEventEnd() just after the VecAssemblyEnd() this is the only 
>>>> way to get an accurate accounting of the time.
>>>> 
>>>>   Barry
>>>> 
>>>> 
>>>> > On Jun 30, 2023, at 11:21 AM, Runfeng Jin <[email protected] 
>>>> > <mailto:[email protected]>> wrote:
>>>> > 
>>>> > Hello!
>>>> > 
>>>> > When I use PETSc build a sbaij matrix, I find a strange thing. When I 
>>>> > increase the number of processors, the assemble time become smaller. All 
>>>> > these are totally same matrix. The assemble time mainly arouse from 
>>>> > message passing, which because I use dynamic workload that it is random 
>>>> > for which elements are computed by which processor.
>>>> > But from instinct, if use more processors, then more possible that the 
>>>> > processor computes elements storing in other processors. But from the 
>>>> > output of log_view, It seems when use more processors, the processors 
>>>> > compute more elements storing in its local(infer from that, with more 
>>>> > processors, less total amount of passed messages).
>>>> > 
>>>> > What could cause this happened? Thank you!
>>>> > 
>>>> > 
>>>> >  Following is the output of log_view for 64\128\256 processors. Every 
>>>> > row is time profiler of VecAssemblyEnd.
>>>> > 
>>>> > ------------------------------------------------------------------------------------------------------------------------
>>>> > processors                Count                      Time (sec)          
>>>> >                             Flop                                         
>>>> >                       --- Global ---                               --- 
>>>> > Stage ----                Total
>>>> >                               Max    Ratio         Max                  
>>>> > Ratio                 Max  Ratio      Mess        AvgLen         Reduct  
>>>> >              %T %F %M %L %R         %T %F %M %L %R       Mflop/s
>>>> > 64                            1     1.0            2.3775e+02      1.0   
>>>> >                 0.00e+00 0.0      6.2e+03    2.3e+04     9.0e+00         
>>>> >         52  0      1    1    2             52   0    1      1     2      
>>>> >        0
>>>> > 128                          1     1.0            6.9945e+01      1.0    
>>>> >                0.00e+00 0.0      2.5e+04    1.1e+04     9.0e+00          
>>>> >       30   0      1     1  2              30   0    1       1    2       
>>>> >       0
>>>> > 256                          1     1.0           1.7445e+01        1.0   
>>>> >                0.00e+00 0.0      9.9e+04     5.2e+03    9.0e+00          
>>>> >       10   0      1     1  2              10   0    1        1   2       
>>>> >       0
>>>> > 
>>>> > Runfeng Jin
>>>> 
>>

Re: [petsc-users] Smaller assemble time with increasing processors

Reply via email to