Re: [petsc-users] Smaller assemble time with increasing processors

Barry Smith Sat, 01 Jul 2023 09:17:01 -0700

   I see no reason not to trust the times below, they seem reasonable. You get 
more than 2 times speed from 64 to 128 and then about 1.38 from 128 to 256.


   The total amount of data moved (number of messages moved times average 
length) goes from 7.0e+03 * 2.8e+05  1.9600e+09 to 2.1060e+09 to 2.3000e+09. A 
pretty moderate amount of data increase, but note that each time you double the 
number of ranks, you also increase substantially the network's hardware to move 
data, so one would hope for a good speed up.

   Also, the load balance is very good, near 1. Often with assembly, we see 
very out-of-balance, and it is difficult to get good speedup when the balance 
is really off.

   It looks like over 90% of the entire run time is coming from setting and 
assembling the values? Also the setting values time dominates assembly time 
more with more ranks.  Are you setting a single value at a time or a collection 
of them? How big are the vectors?

   Run all three cases with -info :vec to see some information about how many 
mallocs where move to hold the stashed vector entries.




> On Jun 30, 2023, at 10:25 PM, Runfeng Jin <[email protected]> wrote:
> 
> 
> 
> Hi, 
>     Thanks for your reply. I try to use PetscLogEvent(), and the result shows 
> same conclusion.
>     What I have done is :
> ----------------
>     PetscLogEvent Mat_assemble_event, Mat_setvalue_event, Mat_setAsse_event;
>     PetscClassId classid;
>     PetscLogDouble user_event_flops;
>     PetscClassIdRegister("Test assemble and set value", &classid);
>     PetscLogEventRegister("Test only assemble", classid, &Mat_assemble_event);
>     PetscLogEventRegister("Test only set values", classid, 
> &Mat_setvalue_event);
>     PetscLogEventRegister("Test both assemble and set values", classid, 
> &Mat_setAsse_event);
>     PetscLogEventBegin(Mat_setAsse_event, 0, 0, 0, 0);
>     PetscLogEventBegin(Mat_setvalue_event, 0, 0, 0, 0);
>     ...compute elements and use MatSetValues. No call for assembly
>     PetscLogEventEnd(Mat_setvalue_event, 0, 0, 0, 0);
> 
>     PetscLogEventBegin(Mat_assemble_event, 0, 0, 0, 0);
>     MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY);
>     MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY);
>     PetscLogEventEnd(Mat_assemble_event, 0, 0, 0, 0);
>     PetscLogEventEnd(Mat_setAsse_event, 0, 0, 0, 0);
> ----------------
> 
>     And the output as follows. By the way, dose petsc recorde all time 
> between PetscLogEventBegin and PetscLogEventEnd? or just test the time of 
> petsc API?

   It is all of the time. 

> ----------------
> Event                Count      Time (sec)     Flop                           
>    --- Global ---  --- Stage ----  Total
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
> 64new               1 1.0 2.3775e+02 1.0 0.00e+00 0.0 6.2e+03 2.3e+04 9.0e+00 
> 52  0  1  1  2  52  0  1  1  2     0
> 128new              1 1.0 6.9945e+01 1.0 0.00e+00 0.0 2.5e+04 1.1e+04 9.0e+00 
> 30  0  1  1  2  30  0  1  1  2     0
> 256new              1 1.0 1.7445e+01 1.0 0.00e+00 0.0 9.9e+04 5.2e+03 9.0e+00 
> 10  0  1  1  2  10  0  1  1  2     0
> 
> 64:
> only assemble       1 1.0 2.6596e+02 1.0 0.00e+00 0.0 7.0e+03 2.8e+05 1.1e+01 
> 55  0  1  8  3  55  0  1  8  3     0
> only setvalues      1 1.0 1.9987e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 
> 41  0  0  0  0  41  0  0  0  0     0
> Test both           1 1.0 4.6580e+02 1.0 0.00e+00 0.0 7.0e+03 2.8e+05 1.5e+01 
> 96  0  1  8  4  96  0  1  8  4     0
> 
> 128:
>  only assemble      1 1.0 6.9718e+01 1.0 0.00e+00 0.0 2.6e+04 8.1e+04 1.1e+01 
> 30  0  1  4  3  30  0  1  4  3     0
> only setvalues      1 1.0 1.4438e+02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 
> 60  0  0  0  0  60  0  0  0  0     0
> Test both           1 1.0 2.1417e+02 1.0 0.00e+00 0.0 2.6e+04 8.1e+04 1.5e+01 
> 91  0  1  4  4  91  0  1  4  4     0
> 
> 256:
> only assemble       1 1.0 1.7482e+01 1.0 0.00e+00 0.0 1.0e+05 2.3e+04 1.1e+01 
> 10  0  1  3  3  10  0  1  3  3     0
> only setvalues      1 1.0 1.3717e+02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 
> 78  0  0  0  0  78  0  0  0  0     0
> Test both           1 1.0 1.5475e+02 1.0 0.00e+00 0.0 1.0e+05 2.3e+04 1.5e+01 
> 91  0  1  3  4  91  0  1  3  4     0 
> 
> 
> 
> Runfeng
> 
> Barry Smith <[email protected] <mailto:[email protected]>> 于2023年6月30日周五 
> 23:35写道：
>> 
>>    You cannot look just at the VecAssemblyEnd() time, that will very likely 
>> give the wrong impression of the total time it takes to put the values in.
>> 
>>    You need to register a new Event and put a PetscLogEvent() just before 
>> you start generating the vector entries and calling VecSetValues() and put 
>> the PetscLogEventEnd() just after the VecAssemblyEnd() this is the only way 
>> to get an accurate accounting of the time.
>> 
>>   Barry
>> 
>> 
>> > On Jun 30, 2023, at 11:21 AM, Runfeng Jin <[email protected] 
>> > <mailto:[email protected]>> wrote:
>> > 
>> > Hello!
>> > 
>> > When I use PETSc build a sbaij matrix, I find a strange thing. When I 
>> > increase the number of processors, the assemble time become smaller. All 
>> > these are totally same matrix. The assemble time mainly arouse from 
>> > message passing, which because I use dynamic workload that it is random 
>> > for which elements are computed by which processor.
>> > But from instinct, if use more processors, then more possible that the 
>> > processor computes elements storing in other processors. But from the 
>> > output of log_view, It seems when use more processors, the processors 
>> > compute more elements storing in its local(infer from that, with more 
>> > processors, less total amount of passed messages).
>> > 
>> > What could cause this happened? Thank you!
>> > 
>> > 
>> >  Following is the output of log_view for 64\128\256 processors. Every row 
>> > is time profiler of VecAssemblyEnd.
>> > 
>> > ------------------------------------------------------------------------------------------------------------------------
>> > processors                Count                      Time (sec)            
>> >                           Flop                                             
>> >                   --- Global ---                               --- Stage 
>> > ----                Total
>> >                               Max    Ratio         Max                  
>> > Ratio                 Max  Ratio      Mess        AvgLen         Reduct    
>> >            %T %F %M %L %R         %T %F %M %L %R       Mflop/s
>> > 64                            1     1.0            2.3775e+02      1.0     
>> >               0.00e+00 0.0      6.2e+03    2.3e+04     9.0e+00             
>> >     52  0      1    1    2             52   0    1      1     2            
>> >  0
>> > 128                          1     1.0            6.9945e+01      1.0      
>> >              0.00e+00 0.0      2.5e+04    1.1e+04     9.0e+00              
>> >   30   0      1     1  2              30   0    1       1    2             >> > 0
>> > 256                          1     1.0           1.7445e+01        1.0     
>> >              0.00e+00 0.0      9.9e+04     5.2e+03    9.0e+00              
>> >   10   0      1     1  2              10   0    1        1   2             >> > 0
>> > 
>> > Runfeng Jin
>>

Re: [petsc-users] Smaller assemble time with increasing processors

Reply via email to