> On Jul 3, 2023, at 10:11 AM, Runfeng Jin <[email protected]> wrote:
>
> Hi,
>> We use a hash table to store the nonzeros on the fly, and then convert to
>> packed storage on assembly.
There is "extra memory" since the matrix entries are first stored in a hash
and then converted into the regular CSR format, so for a short while, both
copies are in memory.
We use the amazing khash package, include/petsc/private/khash/khash.h, our
code is scattered around a bit depending on the matrix format we will be
forming.
cd src/mat
git grep "_Hash("
impls/aij/mpi/mpiaij.c:/* defines MatSetValues_MPI_Hash(),
MatAssemblyBegin_MPI_Hash(), and MatAssemblyEnd_MPI_Hash() */
impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:/* defines
MatSetValues_MPICUSPARSE_Hash() */
impls/aij/mpi/mpicusparse/mpiaijcusparse.cu: PetscCall(MatSetUp_MPI_Hash(A));
impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatSetValues_MPI_Hash(Mat A,
PetscInt m, const PetscInt *rows, PetscInt n, const PetscInt *cols, const
PetscScalar *values, InsertMode addv)
impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatAssemblyBegin_MPI_Hash(Mat
A, PETSC_UNUSED MatAssemblyType type)
impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatAssemblyEnd_MPI_Hash(Mat A,
MatAssemblyType type)
impls/aij/mpi/mpihashmat.h: PetscCall(MatSetValues_MPI_Hash(A, 1, row +
i, ncols, col + i, val + i, A->insertmode));
impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatDestroy_MPI_Hash(Mat A)
impls/aij/mpi/mpihashmat.h:static PetscErrorCode
MatZeroEntries_MPI_Hash(PETSC_UNUSED Mat A)
impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatSetRandom_MPI_Hash(Mat A,
PETSC_UNUSED PetscRandom r)
impls/aij/mpi/mpihashmat.h:static PetscErrorCode MatSetUp_MPI_Hash(Mat A)
impls/aij/seq/aij.c:/* defines MatSetValues_Seq_Hash(),
MatAssemblyEnd_Seq_Hash(), MatSetUp_Seq_Hash() */
impls/aij/seq/seqhashmat.h:static PetscErrorCode MatAssemblyEnd_Seq_Hash(Mat A,
MatAssemblyType type)
impls/aij/seq/seqhashmat.h: A->preallocated = PETSC_FALSE; /* this was set to
true for the MatSetValues_Hash() to work */
impls/aij/seq/seqhashmat.h:static PetscErrorCode MatDestroy_Seq_Hash(Mat A)
impls/aij/seq/seqhashmat.h:static PetscErrorCode MatZeroEntries_Seq_Hash(Mat A)
impls/aij/seq/seqhashmat.h:static PetscErrorCode MatSetRandom_Seq_Hash(Mat A,
PetscRandom r)
impls/aij/seq/seqhashmat.h:static PetscErrorCode MatSetUp_Seq_Hash(Mat A)
impls/baij/mpi/mpibaij.c:/* defines MatSetValues_MPI_Hash(),
MatAssemblyBegin_MPI_Hash(), and MatAssemblyEnd_MPI_Hash() */
impls/baij/seq/baij.c:/* defines MatSetValues_Seq_Hash(),
MatAssemblyEnd_Seq_Hash(), MatSetUp_Seq_Hash() */
impls/sbaij/mpi/mpisbaij.c:/* defines MatSetValues_MPI_Hash(),
MatAssemblyBegin_MPI_Hash(), MatAssemblyEnd_MPI_Hash(), MatSetUp_MPI_Hash() */
impls/sbaij/seq/sbaij.c:/* defines MatSetValues_Seq_Hash(),
MatAssemblyEnd_Seq_Hash(), MatSetUp_Seq_Hash() */
Thanks for the numbers, it is good to see the performance is so similar to that
obtained when providing preallocation information.
>
> Maybe can you tell me which file implements this function?
>
> Runfeng
>
>
>
> Runfeng Jin <[email protected] <mailto:[email protected]>> 于2023年7月3日周一
> 22:05写道:
>> Thank you for all your help!
>>
>> Runfeng
>>
>> Matthew Knepley <[email protected] <mailto:[email protected]>> 于2023年7月3日周一
>> 22:03写道:
>>> On Mon, Jul 3, 2023 at 9:56 AM Runfeng Jin <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>> Hi, impressive performance!
>>>> I use the newest version of petsc(release branch), and it almost deletes
>>>> all assembly and stash time in large processors (assembly time
>>>> 64-4s/128-2s/256-0.2s, stash time all below 2s). For the zero programming
>>>> cost, it really incredible.
>>>> The order code has a regular arrangement of the number of
>>>> nonzero-elements across rows, so I can have a good rough preallocation.
>>>> And from the data, dedicatedly arrange data and roughly acquiring the max
>>>> number of non-zero elements in rows can have a better performance than the
>>>> new version without preallocation. However, in reality, I will use the
>>>> newer version without preallocation for:1)less effort in programming and
>>>> also nearly the same good performance 2) good memory usage(I see no
>>>> unneeded memory after assembly) 3) dedicated preallocation is usually not
>>>> very easy and cause extra time cost.
>>>> Maybe it will be better that leave some space for the user to do a
>>>> slight direction for the preallocation and thus acquire better
>>>> performance. But have no idea how to direct it.
>>>> And I am very curious about how petsc achieves this. How can it not
>>>> know anything but achieve so good performance, and no wasted memory? May
>>>> you have an explanation about this?
>>>
>>> We use a hash table to store the nonzeros on the fly, and then convert to
>>> packed storage on assembly.
>>>
>>> Thanks,
>>>
>>> Matt
>>>
>>>> assemble time:
>>>> version\processors 4 8 16 32
>>>> 64 128 256
>>>> old 14677s 4694s 1124s 572s
>>>> 38s 8s 2s
>>>> new 50s 28s 15s 7.8s
>>>> 4s 2s 0.4s
>>>> older 27s 24s 19s 12s
>>>> 14s - -
>>>> stash time(max among all processors):
>>>> version\processors 4 8 16 32
>>>> 64 128 256
>>>> old 3145s 2554s 673s 329s
>>>> 201s 142s 138s
>>>> new 2s 1s ~0s
>>>> ~0s ~0s ~0s ~0s
>>>> older 10s 73s 18s 5s
>>>> 1s - -
>>>> old: my poor preallocation
>>>> new: newest version of petsc that do not preallocation
>>>> older: the best preallocation version of my code.
>>>>
>>>>
>>>> Runfeng
>>>>
>>>> Barry Smith <[email protected] <mailto:[email protected]>> 于2023年7月3日周一
>>>> 12:19写道:
>>>>>
>>>>> The main branch of PETSc now supports filling sparse matrices without
>>>>> providing any preallocation information.
>>>>>
>>>>> You can give it a try. Use your current fastest code but just remove
>>>>> ALL the preallocation calls. I would be interested in what kind of
>>>>> performance you get compared to your best current performance.
>>>>>
>>>>> Barry
>>>>>
>>>>>
>>>>>
>>>>>> On Jul 2, 2023, at 11:24 PM, Runfeng Jin <[email protected]
>>>>>> <mailto:[email protected]>> wrote:
>>>>>>
>>>>>> Hi! Good advice!
>>>>>> I set value with MatSetValues() API, which sets one part of a row at
>>>>>> a time(I use a kind of tiling technology so I cannot get all values of a
>>>>>> row at a time).
>>>>>> I tested the number of malloc in these three cases. The number of
>>>>>> mallocs is decreasing with the increase of processors, and all these are
>>>>>> very large(the matrix is 283234*283234, as can see in the following).
>>>>>> This should be due to the unqualified preallocation. I use a rough
>>>>>> preallocation, that every processor counts the number of nonzero
>>>>>> elements for the first 10 rows, and uses the largest one to preallocate
>>>>>> memory for all local rows. It seems that not work well.
>>>>>> number_of_processors number_of_max_mallocs_among_all_processors
>>>>>> 64 20000
>>>>>> 128 17000
>>>>>> 256 11000
>>>>>> I change my way to preallocate. I evenly take 100 rows in every
>>>>>> local matrix and take the largest one to preallocate memory for all
>>>>>> local rows. Now the assemble time is reduced to a very small time.
>>>>>> number_of_processors number_of_max_mallocs_among_all_processors
>>>>>> 64 3000
>>>>>> 128 700
>>>>>> 256 500
>>>>>> Event Count Time (sec) Flop
>>>>>> --- Global ---
>>>>>> --- Stage ---- Total
>>>>>> Max Ratio Max Ratio Max Ratio Mess
>>>>>> AvgLen Reduct %T %F %M %L %R %T %F %M %L %R
>>>>>> Mflop/s
>>>>>> 64 1 1.0 3.8999e+01 1.0 0.00e+00 0.0
>>>>>> 7.1e+03 2.9e+05 1.1e+01 15 0 1 8 3 15 0 1
>>>>>> 8 3 0
>>>>>> 128 1 1.0 8.5714e+00 1.0 0.00e+00 0.0 2.6e+04
>>>>>> 8.1e+04 1.1e+01 5 0 1 4 3 5 0 1 4 3
>>>>>> 0
>>>>>> 256 1 1.0 2.5512e+00 1.0 0.00e+00 0.0 1.0e+05
>>>>>> 2.3e+04 1.1e+01 2 0 1 3 3 2 0 1 3 3
>>>>>> 0
>>>>>> So the reason "why assemble time is smaller with the increasing number
>>>>>> of processors " may be because more processors divide the malloc job so
>>>>>> that total time is reduced?
>>>>>> If so, I still have some questions:
>>>>>> 1. If preallocation is not accurate, will the performance of the
>>>>>> assembly be affected? I mean, when processors receive the elements that
>>>>>> should be stored in their local by MPI, then will the new mallocs
>>>>>> happen at this time point?
>>>>>> 2. I can not give an accurate preallocation for the large cost, so
>>>>>> is there any better way to preallocate for my situation?
>>>>>>
>>>>>>
>>>>>> Barry Smith <[email protected] <mailto:[email protected]>> 于2023年7月2日周日
>>>>>> 00:16写道:
>>>>>>>
>>>>>>> I see no reason not to trust the times below, they seem reasonable.
>>>>>>> You get more than 2 times speed from 64 to 128 and then about 1.38 from
>>>>>>> 128 to 256.
>>>>>>>
>>>>>>> The total amount of data moved (number of messages moved times
>>>>>>> average length) goes from 7.0e+03 * 2.8e+05 1.9600e+09 to 2.1060e+09
>>>>>>> to 2.3000e+09. A pretty moderate amount of data increase, but note that
>>>>>>> each time you double the number of ranks, you also increase
>>>>>>> substantially the network's hardware to move data, so one would hope
>>>>>>> for a good speed up.
>>>>>>>
>>>>>>> Also, the load balance is very good, near 1. Often with assembly, we
>>>>>>> see very out-of-balance, and it is difficult to get good speedup when
>>>>>>> the balance is really off.
>>>>>>>
>>>>>>> It looks like over 90% of the entire run time is coming from setting
>>>>>>> and assembling the values? Also the setting values time dominates
>>>>>>> assembly time more with more ranks. Are you setting a single value at
>>>>>>> a time or a collection of them? How big are the vectors?
>>>>>>>
>>>>>>> Run all three cases with -info :vec to see some information about
>>>>>>> how many mallocs where move to hold the stashed vector entries.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On Jun 30, 2023, at 10:25 PM, Runfeng Jin <[email protected]
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>> Thanks for your reply. I try to use PetscLogEvent(), and the
>>>>>>>> result shows same conclusion.
>>>>>>>> What I have done is :
>>>>>>>> ----------------
>>>>>>>> PetscLogEvent Mat_assemble_event, Mat_setvalue_event,
>>>>>>>> Mat_setAsse_event;
>>>>>>>> PetscClassId classid;
>>>>>>>> PetscLogDouble user_event_flops;
>>>>>>>> PetscClassIdRegister("Test assemble and set value", &classid);
>>>>>>>> PetscLogEventRegister("Test only assemble", classid,
>>>>>>>> &Mat_assemble_event);
>>>>>>>> PetscLogEventRegister("Test only set values", classid,
>>>>>>>> &Mat_setvalue_event);
>>>>>>>> PetscLogEventRegister("Test both assemble and set values",
>>>>>>>> classid, &Mat_setAsse_event);
>>>>>>>> PetscLogEventBegin(Mat_setAsse_event, 0, 0, 0, 0);
>>>>>>>> PetscLogEventBegin(Mat_setvalue_event, 0, 0, 0, 0);
>>>>>>>> ...compute elements and use MatSetValues. No call for assembly
>>>>>>>> PetscLogEventEnd(Mat_setvalue_event, 0, 0, 0, 0);
>>>>>>>>
>>>>>>>> PetscLogEventBegin(Mat_assemble_event, 0, 0, 0, 0);
>>>>>>>> MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY);
>>>>>>>> MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY);
>>>>>>>> PetscLogEventEnd(Mat_assemble_event, 0, 0, 0, 0);
>>>>>>>> PetscLogEventEnd(Mat_setAsse_event, 0, 0, 0, 0);
>>>>>>>> ----------------
>>>>>>>>
>>>>>>>> And the output as follows. By the way, dose petsc recorde all time
>>>>>>>> between PetscLogEventBegin and PetscLogEventEnd? or just test the time
>>>>>>>> of petsc API?
>>>>>>>
>>>>>>> It is all of the time.
>>>>>>>
>>>>>>>> ----------------
>>>>>>>> Event Count Time (sec) Flop
>>>>>>>> --- Global --- --- Stage ---- Total
>>>>>>>> Max Ratio Max Ratio Max Ratio Mess
>>>>>>>> AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
>>>>>>>> 64new 1 1.0 2.3775e+02 1.0 0.00e+00 0.0 6.2e+03 2.3e+04
>>>>>>>> 9.0e+00 52 0 1 1 2 52 0 1 1 2 0
>>>>>>>> 128new 1 1.0 6.9945e+01 1.0 0.00e+00 0.0 2.5e+04 1.1e+04
>>>>>>>> 9.0e+00 30 0 1 1 2 30 0 1 1 2 0
>>>>>>>> 256new 1 1.0 1.7445e+01 1.0 0.00e+00 0.0 9.9e+04 5.2e+03
>>>>>>>> 9.0e+00 10 0 1 1 2 10 0 1 1 2 0
>>>>>>>>
>>>>>>>> 64:
>>>>>>>> only assemble 1 1.0 2.6596e+02 1.0 0.00e+00 0.0 7.0e+03 2.8e+05
>>>>>>>> 1.1e+01 55 0 1 8 3 55 0 1 8 3 0
>>>>>>>> only setvalues 1 1.0 1.9987e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>>>>>>> 0.0e+00 41 0 0 0 0 41 0 0 0 0 0
>>>>>>>> Test both 1 1.0 4.6580e+02 1.0 0.00e+00 0.0 7.0e+03 2.8e+05
>>>>>>>> 1.5e+01 96 0 1 8 4 96 0 1 8 4 0
>>>>>>>>
>>>>>>>> 128:
>>>>>>>> only assemble 1 1.0 6.9718e+01 1.0 0.00e+00 0.0 2.6e+04 8.1e+04
>>>>>>>> 1.1e+01 30 0 1 4 3 30 0 1 4 3 0
>>>>>>>> only setvalues 1 1.0 1.4438e+02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
>>>>>>>> 0.0e+00 60 0 0 0 0 60 0 0 0 0 0
>>>>>>>> Test both 1 1.0 2.1417e+02 1.0 0.00e+00 0.0 2.6e+04 8.1e+04
>>>>>>>> 1.5e+01 91 0 1 4 4 91 0 1 4 4 0
>>>>>>>>
>>>>>>>> 256:
>>>>>>>> only assemble 1 1.0 1.7482e+01 1.0 0.00e+00 0.0 1.0e+05 2.3e+04
>>>>>>>> 1.1e+01 10 0 1 3 3 10 0 1 3 3 0
>>>>>>>> only setvalues 1 1.0 1.3717e+02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
>>>>>>>> 0.0e+00 78 0 0 0 0 78 0 0 0 0 0
>>>>>>>> Test both 1 1.0 1.5475e+02 1.0 0.00e+00 0.0 1.0e+05 2.3e+04
>>>>>>>> 1.5e+01 91 0 1 3 4 91 0 1 3 4 0
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Runfeng
>>>>>>>>
>>>>>>>> Barry Smith <[email protected] <mailto:[email protected]>> 于2023年6月30日周五
>>>>>>>> 23:35写道:
>>>>>>>>>
>>>>>>>>> You cannot look just at the VecAssemblyEnd() time, that will very
>>>>>>>>> likely give the wrong impression of the total time it takes to put
>>>>>>>>> the values in.
>>>>>>>>>
>>>>>>>>> You need to register a new Event and put a PetscLogEvent() just
>>>>>>>>> before you start generating the vector entries and calling
>>>>>>>>> VecSetValues() and put the PetscLogEventEnd() just after the
>>>>>>>>> VecAssemblyEnd() this is the only way to get an accurate accounting
>>>>>>>>> of the time.
>>>>>>>>>
>>>>>>>>> Barry
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> > On Jun 30, 2023, at 11:21 AM, Runfeng Jin <[email protected]
>>>>>>>>> > <mailto:[email protected]>> wrote:
>>>>>>>>> >
>>>>>>>>> > Hello!
>>>>>>>>> >
>>>>>>>>> > When I use PETSc build a sbaij matrix, I find a strange thing. When
>>>>>>>>> > I increase the number of processors, the assemble time become
>>>>>>>>> > smaller. All these are totally same matrix. The assemble time
>>>>>>>>> > mainly arouse from message passing, which because I use dynamic
>>>>>>>>> > workload that it is random for which elements are computed by which
>>>>>>>>> > processor.
>>>>>>>>> > But from instinct, if use more processors, then more possible that
>>>>>>>>> > the processor computes elements storing in other processors. But
>>>>>>>>> > from the output of log_view, It seems when use more processors, the
>>>>>>>>> > processors compute more elements storing in its local(infer from
>>>>>>>>> > that, with more processors, less total amount of passed messages).
>>>>>>>>> >
>>>>>>>>> > What could cause this happened? Thank you!
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > Following is the output of log_view for 64\128\256 processors.
>>>>>>>>> > Every row is time profiler of VecAssemblyEnd.
>>>>>>>>> >
>>>>>>>>> > ------------------------------------------------------------------------------------------------------------------------
>>>>>>>>> > processors Count Time (sec)
>>>>>>>>> > Flop
>>>>>>>>> > --- Global ---
>>>>>>>>> > --- Stage ---- Total
>>>>>>>>> > Max Ratio Max
>>>>>>>>> > Ratio Max Ratio Mess AvgLen
>>>>>>>>> > Reduct %T %F %M %L %R %T %F %M %L %R
>>>>>>>>> > Mflop/s
>>>>>>>>> > 64 1 1.0 2.3775e+02
>>>>>>>>> > 1.0 0.00e+00 0.0 6.2e+03 2.3e+04
>>>>>>>>> > 9.0e+00 52 0 1 1 2 52 0
>>>>>>>>> > 1 1 2 0
>>>>>>>>> > 128 1 1.0 6.9945e+01
>>>>>>>>> > 1.0 0.00e+00 0.0 2.5e+04 1.1e+04
>>>>>>>>> > 9.0e+00 30 0 1 1 2 30 0
>>>>>>>>> > 1 1 2 0
>>>>>>>>> > 256 1 1.0 1.7445e+01
>>>>>>>>> > 1.0 0.00e+00 0.0 9.9e+04 5.2e+03
>>>>>>>>> > 9.0e+00 10 0 1 1 2 10 0
>>>>>>>>> > 1 1 2 0
>>>>>>>>> >
>>>>>>>>> > Runfeng Jin
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>>>
>>> --
>>> What most experimenters take for granted before they begin their
>>> experiments is infinitely more interesting than any results to which their
>>> experiments lead.
>>> -- Norbert Wiener
>>>
>>> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>