Re: [petsc-users] Smaller assemble time with increasing processors

Barry Smith Mon, 03 Jul 2023 07:52:46 -0700


> On Jul 3, 2023, at 10:11 AM, Runfeng Jin <[email protected]> wrote:
> 
> Hi, 
>> We use a hash table to store the nonzeros on the fly, and then convert to 
>> packed storage on assembly.


   There is "extra memory" since the matrix entries are first stored in a hash 
and then converted into the regular CSR format, so for a short while, both 
copies are in memory. 

    We use the amazing khash package, include/petsc/private/khash/khash.h, our 
code is scattered around a bit depending on the matrix format we will be 
forming. 

cd src/mat
git grep "_Hash("
impls/aij/mpi/mpiaij.c:/* defines MatSetValues_MPI_Hash(), 
MatAssemblyBegin_MPI_Hash(), and  MatAssemblyEnd_MPI_Hash() */
impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:/* defines 
MatSetValues_MPICUSPARSE_Hash() */
impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:  PetscCall(MatSetUp_MPI_Hash(A));
impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatSetValues_MPI_Hash(Mat A, 
PetscInt m, const PetscInt *rows, PetscInt n, const PetscInt *cols, const 
PetscScalar *values, InsertMode addv)
impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatAssemblyBegin_MPI_Hash(Mat 
A, PETSC_UNUSED MatAssemblyType type)
impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatAssemblyEnd_MPI_Hash(Mat A, 
MatAssemblyType type)
impls/aij/mpi/mpihashmat.h:        PetscCall(MatSetValues_MPI_Hash(A, 1, row + 
i, ncols, col + i, val + i, A->insertmode));
impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatDestroy_MPI_Hash(Mat A)
impls/aij/mpi/mpihashmat.h:static PetscErrorCode 
MatZeroEntries_MPI_Hash(PETSC_UNUSED Mat A)
impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatSetRandom_MPI_Hash(Mat A, 
PETSC_UNUSED PetscRandom r)
impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatSetUp_MPI_Hash(Mat A)
impls/aij/seq/aij.c:/* defines MatSetValues_Seq_Hash(), 
MatAssemblyEnd_Seq_Hash(), MatSetUp_Seq_Hash() */
impls/aij/seq/seqhashmat.h:static PetscErrorCode MatAssemblyEnd_Seq_Hash(Mat A, 
MatAssemblyType type)
impls/aij/seq/seqhashmat.h:  A->preallocated = PETSC_FALSE; /* this was set to 
true for the MatSetValues_Hash() to work */
impls/aij/seq/seqhashmat.h:static PetscErrorCode MatDestroy_Seq_Hash(Mat A)
impls/aij/seq/seqhashmat.h:static PetscErrorCode MatZeroEntries_Seq_Hash(Mat A)
impls/aij/seq/seqhashmat.h:static PetscErrorCode MatSetRandom_Seq_Hash(Mat A, 
PetscRandom r)
impls/aij/seq/seqhashmat.h:static PetscErrorCode MatSetUp_Seq_Hash(Mat A)
impls/baij/mpi/mpibaij.c:/* defines MatSetValues_MPI_Hash(), 
MatAssemblyBegin_MPI_Hash(), and  MatAssemblyEnd_MPI_Hash() */
impls/baij/seq/baij.c:/* defines MatSetValues_Seq_Hash(), 
MatAssemblyEnd_Seq_Hash(), MatSetUp_Seq_Hash() */
impls/sbaij/mpi/mpisbaij.c:/* defines MatSetValues_MPI_Hash(), 
MatAssemblyBegin_MPI_Hash(), MatAssemblyEnd_MPI_Hash(), MatSetUp_MPI_Hash() */
impls/sbaij/seq/sbaij.c:/* defines MatSetValues_Seq_Hash(), 
MatAssemblyEnd_Seq_Hash(), MatSetUp_Seq_Hash() */

Thanks for the numbers, it is good to see the performance is so similar to that 
obtained when providing preallocation information.

> 
> Maybe can you tell me which file implements this function? 
> 
> Runfeng
> 
> 
> 
> Runfeng Jin <[email protected] <mailto:[email protected]>> 于2023年7月3日周一 
> 22:05写道：
>> Thank you for all your help!
>> 
>> Runfeng
>> 
>> Matthew Knepley <[email protected] <mailto:[email protected]>> 于2023年7月3日周一 
>> 22:03写道：
>>> On Mon, Jul 3, 2023 at 9:56 AM Runfeng Jin <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>>> Hi, impressive performance!
>>>>   I use the newest version of petsc(release branch), and it almost deletes 
>>>> all assembly and stash time in large processors (assembly time 
>>>> 64-4s/128-2s/256-0.2s, stash time all below 2s). For the zero programming 
>>>> cost, it really incredible. 
>>>>   The order code has a regular arrangement of the number of 
>>>> nonzero-elements across rows, so I can have a good rough preallocation. 
>>>> And from the data, dedicatedly arrange data and roughly acquiring the max 
>>>> number of non-zero elements in rows can have a better performance than the 
>>>> new version without preallocation. However, in reality, I will use the 
>>>> newer version without preallocation for:1)less effort in programming and 
>>>> also nearly the same good performance 2) good memory usage(I see no 
>>>> unneeded memory after assembly) 3) dedicated preallocation is usually not 
>>>> very easy and cause extra time cost.
>>>>    Maybe it will be better that leave some space for the user to do a 
>>>> slight direction for the preallocation and thus acquire better 
>>>> performance. But have no idea how to direct it.
>>>>    And I am very curious about how petsc achieves this. How can it not 
>>>> know anything but achieve so good performance, and no wasted memory? May 
>>>> you have an explanation about this?
>>> 
>>> We use a hash table to store the nonzeros on the fly, and then convert to 
>>> packed storage on assembly.
>>> 
>>>   Thanks,
>>> 
>>>      Matt
>>>  
>>>> assemble time:
>>>> version\processors               4            8        16         32       
>>>>     64        128         256
>>>>      old                             14677s   4694s   1124s     572s       
>>>>  38s         8s          2s
>>>>      new                                50s      28s       15s        7.8s 
>>>>         4s          2s        0.4s
>>>>      older                              27s       24s        19s       12s 
>>>>         14s         -              -
>>>> stash time(max among all processors):
>>>> version\processors               4            8        16         32       
>>>>     64        128         256
>>>>      old                                 3145s   2554s   673s     329s     
>>>>   201s     142s     138s
>>>>      new                                2s         1s        ~0s        
>>>> ~0s         ~0s          ~0s       ~0s
>>>>      older                              10s       73s        18s       5s  
>>>>           1s         -              -
>>>> old: my poor preallocation
>>>> new: newest version of petsc that do not preallocation
>>>> older: the best preallocation version of my code. 
>>>> 
>>>> 
>>>> Runfeng
>>>> 
>>>> Barry Smith <[email protected] <mailto:[email protected]>> 于2023年7月3日周一 
>>>> 12:19写道：
>>>>> 
>>>>>    The main branch of PETSc now supports filling sparse matrices without 
>>>>> providing any preallocation information.
>>>>> 
>>>>>    You can give it a try. Use your current fastest code but just remove 
>>>>> ALL the preallocation calls. I would be interested in what kind of 
>>>>> performance you get compared to your best current performance.
>>>>> 
>>>>>   Barry
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Jul 2, 2023, at 11:24 PM, Runfeng Jin <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>> Hi! Good advice!
>>>>>>     I set value with MatSetValues() API, which sets one part of a row at 
>>>>>> a time(I use a kind of tiling technology so I cannot get all values of a 
>>>>>> row at a time).
>>>>>>     I tested the number of malloc in these three cases.  The number of 
>>>>>> mallocs is decreasing with the increase of processors, and all these are 
>>>>>> very large(the matrix is 283234*283234, as can see in the following). 
>>>>>> This should be due to the unqualified preallocation. I use a rough 
>>>>>> preallocation, that every processor counts the number of nonzero 
>>>>>> elements for the first 10 rows, and uses the largest one to preallocate 
>>>>>> memory for all local rows. It seems that not work well. 
>>>>>> number_of_processors   number_of_max_mallocs_among_all_processors  
>>>>>> 64                                     20000
>>>>>> 128                                   17000
>>>>>> 256                                   11000
>>>>>>     I change my way to preallocate. I evenly take 100 rows in every 
>>>>>> local matrix and take the largest one to preallocate memory for all 
>>>>>> local rows. Now the assemble time is reduced to a very small time.
>>>>>> number_of_processors   number_of_max_mallocs_among_all_processors  
>>>>>> 64                                     3000
>>>>>> 128                                   700
>>>>>> 256                                   500
>>>>>> Event                Count          Time (sec)            Flop           
>>>>>>                                                    --- Global ---        
>>>>>>           --- Stage ----              Total
>>>>>>                    Max Ratio        Max     Ratio       Max  Ratio  Mess 
>>>>>>            AvgLen  Reduct  %T %F %M %L %R        %T %F %M %L %R    
>>>>>> Mflop/s
>>>>>> 64                 1    1.0       3.8999e+01 1.0     0.00e+00 0.0 
>>>>>> 7.1e+03     2.9e+05 1.1e+01 15  0  1  8  3                     15  0  1  
>>>>>> 8  3                  0
>>>>>> 128               1    1.0       8.5714e+00 1.0     0.00e+00 0.0 2.6e+04 
>>>>>>     8.1e+04 1.1e+01  5  0  1  4  3                       5  0  1  4  3   
>>>>>>                 0
>>>>>> 256               1    1.0        2.5512e+00 1.0    0.00e+00 0.0 1.0e+05 
>>>>>>     2.3e+04 1.1e+01  2  0  1  3  3                       2  0  1  3  3   
>>>>>>                 0
>>>>>> So the reason "why assemble time is smaller with the increasing number 
>>>>>> of processors " may be because more processors divide the malloc job so 
>>>>>> that total time is reduced? 
>>>>>>  If so, I still have some questions:
>>>>>>     1. If preallocation is not accurate, will the performance of the 
>>>>>> assembly be affected? I mean, when processors receive the elements that 
>>>>>> should be stored in their local by MPI, then will the new mallocs  
>>>>>> happen at this time point?
>>>>>>     2. I can not give an accurate preallocation for the large cost, so 
>>>>>> is there any better way to preallocate for my situation?
>>>>>> 
>>>>>> 
>>>>>> Barry Smith <[email protected] <mailto:[email protected]>> 于2023年7月2日周日 
>>>>>> 00:16写道：
>>>>>>> 
>>>>>>>    I see no reason not to trust the times below, they seem reasonable. 
>>>>>>> You get more than 2 times speed from 64 to 128 and then about 1.38 from 
>>>>>>> 128 to 256. 
>>>>>>> 
>>>>>>>    The total amount of data moved (number of messages moved times 
>>>>>>> average length) goes from 7.0e+03 * 2.8e+05  1.9600e+09 to 2.1060e+09 
>>>>>>> to 2.3000e+09. A pretty moderate amount of data increase, but note that 
>>>>>>> each time you double the number of ranks, you also increase 
>>>>>>> substantially the network's hardware to move data, so one would hope 
>>>>>>> for a good speed up.
>>>>>>> 
>>>>>>>    Also, the load balance is very good, near 1. Often with assembly, we 
>>>>>>> see very out-of-balance, and it is difficult to get good speedup when 
>>>>>>> the balance is really off.
>>>>>>> 
>>>>>>>    It looks like over 90% of the entire run time is coming from setting 
>>>>>>> and assembling the values? Also the setting values time dominates 
>>>>>>> assembly time more with more ranks.  Are you setting a single value at 
>>>>>>> a time or a collection of them? How big are the vectors?
>>>>>>> 
>>>>>>>    Run all three cases with -info :vec to see some information about 
>>>>>>> how many mallocs where move to hold the stashed vector entries.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Jun 30, 2023, at 10:25 PM, Runfeng Jin <[email protected] 
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi, 
>>>>>>>>     Thanks for your reply. I try to use PetscLogEvent(), and the 
>>>>>>>> result shows same conclusion.
>>>>>>>>     What I have done is :
>>>>>>>> ----------------
>>>>>>>>     PetscLogEvent Mat_assemble_event, Mat_setvalue_event, 
>>>>>>>> Mat_setAsse_event;
>>>>>>>>     PetscClassId classid;
>>>>>>>>     PetscLogDouble user_event_flops;
>>>>>>>>     PetscClassIdRegister("Test assemble and set value", &classid);
>>>>>>>>     PetscLogEventRegister("Test only assemble", classid, 
>>>>>>>> &Mat_assemble_event);
>>>>>>>>     PetscLogEventRegister("Test only set values", classid, 
>>>>>>>> &Mat_setvalue_event);
>>>>>>>>     PetscLogEventRegister("Test both assemble and set values", 
>>>>>>>> classid, &Mat_setAsse_event);
>>>>>>>>     PetscLogEventBegin(Mat_setAsse_event, 0, 0, 0, 0);
>>>>>>>>     PetscLogEventBegin(Mat_setvalue_event, 0, 0, 0, 0);
>>>>>>>>     ...compute elements and use MatSetValues. No call for assembly
>>>>>>>>     PetscLogEventEnd(Mat_setvalue_event, 0, 0, 0, 0);
>>>>>>>> 
>>>>>>>>     PetscLogEventBegin(Mat_assemble_event, 0, 0, 0, 0);
>>>>>>>>     MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY);
>>>>>>>>     MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY);
>>>>>>>>     PetscLogEventEnd(Mat_assemble_event, 0, 0, 0, 0);
>>>>>>>>     PetscLogEventEnd(Mat_setAsse_event, 0, 0, 0, 0);
>>>>>>>> ----------------
>>>>>>>> 
>>>>>>>>     And the output as follows. By the way, dose petsc recorde all time 
>>>>>>>> between PetscLogEventBegin and PetscLogEventEnd? or just test the time 
>>>>>>>> of petsc API?
>>>>>>> 
>>>>>>>    It is all of the time. 
>>>>>>> 
>>>>>>>> ----------------
>>>>>>>> Event                Count      Time (sec)     Flop                    
>>>>>>>>           --- Global ---  --- Stage ----  Total
>>>>>>>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   
>>>>>>>> AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
>>>>>>>> 64new               1 1.0 2.3775e+02 1.0 0.00e+00 0.0 6.2e+03 2.3e+04 
>>>>>>>> 9.0e+00 52  0  1  1  2  52  0  1  1  2     0
>>>>>>>> 128new              1 1.0 6.9945e+01 1.0 0.00e+00 0.0 2.5e+04 1.1e+04 
>>>>>>>> 9.0e+00 30  0  1  1  2  30  0  1  1  2     0
>>>>>>>> 256new              1 1.0 1.7445e+01 1.0 0.00e+00 0.0 9.9e+04 5.2e+03 
>>>>>>>> 9.0e+00 10  0  1  1  2  10  0  1  1  2     0
>>>>>>>> 
>>>>>>>> 64:
>>>>>>>> only assemble       1 1.0 2.6596e+02 1.0 0.00e+00 0.0 7.0e+03 2.8e+05 
>>>>>>>> 1.1e+01 55  0  1  8  3  55  0  1  8  3     0
>>>>>>>> only setvalues      1 1.0 1.9987e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>>>>> 0.0e+00 41  0  0  0  0  41  0  0  0  0     0
>>>>>>>> Test both           1 1.0 4.6580e+02 1.0 0.00e+00 0.0 7.0e+03 2.8e+05 
>>>>>>>> 1.5e+01 96  0  1  8  4  96  0  1  8  4     0
>>>>>>>> 
>>>>>>>> 128:
>>>>>>>>  only assemble      1 1.0 6.9718e+01 1.0 0.00e+00 0.0 2.6e+04 8.1e+04 
>>>>>>>> 1.1e+01 30  0  1  4  3  30  0  1  4  3     0
>>>>>>>> only setvalues      1 1.0 1.4438e+02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>>>>> 0.0e+00 60  0  0  0  0  60  0  0  0  0     0
>>>>>>>> Test both           1 1.0 2.1417e+02 1.0 0.00e+00 0.0 2.6e+04 8.1e+04 
>>>>>>>> 1.5e+01 91  0  1  4  4  91  0  1  4  4     0
>>>>>>>> 
>>>>>>>> 256:
>>>>>>>> only assemble       1 1.0 1.7482e+01 1.0 0.00e+00 0.0 1.0e+05 2.3e+04 
>>>>>>>> 1.1e+01 10  0  1  3  3  10  0  1  3  3     0
>>>>>>>> only setvalues      1 1.0 1.3717e+02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>>>>> 0.0e+00 78  0  0  0  0  78  0  0  0  0     0
>>>>>>>> Test both           1 1.0 1.5475e+02 1.0 0.00e+00 0.0 1.0e+05 2.3e+04 
>>>>>>>> 1.5e+01 91  0  1  3  4  91  0  1  3  4     0 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Runfeng
>>>>>>>> 
>>>>>>>> Barry Smith <[email protected] <mailto:[email protected]>> 于2023年6月30日周五 
>>>>>>>> 23:35写道：
>>>>>>>>> 
>>>>>>>>>    You cannot look just at the VecAssemblyEnd() time, that will very 
>>>>>>>>> likely give the wrong impression of the total time it takes to put 
>>>>>>>>> the values in.
>>>>>>>>> 
>>>>>>>>>    You need to register a new Event and put a PetscLogEvent() just 
>>>>>>>>> before you start generating the vector entries and calling 
>>>>>>>>> VecSetValues() and put the PetscLogEventEnd() just after the 
>>>>>>>>> VecAssemblyEnd() this is the only way to get an accurate accounting 
>>>>>>>>> of the time.
>>>>>>>>> 
>>>>>>>>>   Barry
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> > On Jun 30, 2023, at 11:21 AM, Runfeng Jin <[email protected] 
>>>>>>>>> > <mailto:[email protected]>> wrote:
>>>>>>>>> > 
>>>>>>>>> > Hello!
>>>>>>>>> > 
>>>>>>>>> > When I use PETSc build a sbaij matrix, I find a strange thing. When 
>>>>>>>>> > I increase the number of processors, the assemble time become 
>>>>>>>>> > smaller. All these are totally same matrix. The assemble time 
>>>>>>>>> > mainly arouse from message passing, which because I use dynamic 
>>>>>>>>> > workload that it is random for which elements are computed by which 
>>>>>>>>> > processor.
>>>>>>>>> > But from instinct, if use more processors, then more possible that 
>>>>>>>>> > the processor computes elements storing in other processors. But 
>>>>>>>>> > from the output of log_view, It seems when use more processors, the 
>>>>>>>>> > processors compute more elements storing in its local(infer from 
>>>>>>>>> > that, with more processors, less total amount of passed messages).
>>>>>>>>> > 
>>>>>>>>> > What could cause this happened? Thank you!
>>>>>>>>> > 
>>>>>>>>> > 
>>>>>>>>> >  Following is the output of log_view for 64\128\256 processors. 
>>>>>>>>> > Every row is time profiler of VecAssemblyEnd.
>>>>>>>>> > 
>>>>>>>>> > ------------------------------------------------------------------------------------------------------------------------
>>>>>>>>> > processors                Count                      Time (sec)     
>>>>>>>>> >                                  Flop                               
>>>>>>>>> >                                 --- Global ---                      
>>>>>>>>> >          --- Stage ----                Total
>>>>>>>>> >                               Max    Ratio         Max              
>>>>>>>>> >     Ratio                 Max  Ratio      Mess        AvgLen        
>>>>>>>>> >  Reduct               %T %F %M %L %R         %T %F %M %L %R       
>>>>>>>>> > Mflop/s
>>>>>>>>> > 64                            1     1.0            2.3775e+02      
>>>>>>>>> > 1.0                   0.00e+00 0.0      6.2e+03    2.3e+04     
>>>>>>>>> > 9.0e+00                 52  0      1    1    2             52   0   
>>>>>>>>> >  1      1     2             0
>>>>>>>>> > 128                          1     1.0            6.9945e+01      
>>>>>>>>> > 1.0                   0.00e+00 0.0      2.5e+04    1.1e+04     
>>>>>>>>> > 9.0e+00                30   0      1     1  2              30   0   
>>>>>>>>> >  1       1    2             0
>>>>>>>>> > 256                          1     1.0           1.7445e+01        
>>>>>>>>> > 1.0                  0.00e+00 0.0      9.9e+04     5.2e+03    
>>>>>>>>> > 9.0e+00                10   0      1     1  2              10   0   
>>>>>>>>> >  1        1   2             0
>>>>>>>>> > 
>>>>>>>>> > Runfeng Jin
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> 
>>> 
>>> -- 
>>> What most experimenters take for granted before they begin their 
>>> experiments is infinitely more interesting than any results to which their 
>>> experiments lead.
>>> -- Norbert Wiener
>>> 
>>> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>

Re: [petsc-users] Smaller assemble time with increasing processors

Reply via email to