Thanks, Matt! Sincerely, SG
On Tue, Oct 19, 2021 at 9:34 PM Matthew Knepley <[email protected]> wrote: > On Tue, Oct 19, 2021 at 9:18 PM Swarnava Ghosh <[email protected]> > wrote: > >> Thank you Junchao! Is it possible to determine how much time is being >> spent on data transfer from the CPU mem to the GPU mem from the log? >> > > It looks like > > VecCUDACopyTo 891 1.1 1.5322e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 842 6.23e+01 0 > 0.00e+00 0 > > VecCUDACopyFrom 891 1.1 1.5837e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 842 > 6.23e+01 0 > > MatCUSPARSCopyTo 891 1.1 1.5229e-01 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 842 1.93e+03 0 > 0.00e+00 0 > > Thanks, > > Matt > > >> >> ************************************************************************************************************************ >> >> *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r >> -fCourier9' to print this document *** >> >> >> ************************************************************************************************************************ >> >> >> ---------------------------------------------- PETSc Performance Summary: >> ---------------------------------------------- >> >> >> /ccsopen/home/swarnava/MiniApp_xl_cu/bin/sq on a named h49n15 with 4 >> processors, by swarnava Tue Oct 19 21:10:56 2021 >> >> Using Petsc Release Version 3.15.0, Mar 30, 2021 >> >> >> Max Max/Min Avg Total >> >> Time (sec): 1.172e+02 1.000 1.172e+02 >> >> Objects: 1.160e+02 1.000 1.160e+02 >> >> Flop: 5.832e+10 1.125 5.508e+10 2.203e+11 >> >> Flop/sec: 4.974e+08 1.125 4.698e+08 1.879e+09 >> >> MPI Messages: 0.000e+00 0.000 0.000e+00 0.000e+00 >> >> MPI Message Lengths: 0.000e+00 0.000 0.000e+00 0.000e+00 >> >> MPI Reductions: 1.320e+02 1.000 >> >> >> Flop counting convention: 1 flop = 1 real number operation of type >> (multiply/divide/add/subtract) >> >> e.g., VecAXPY() for real vectors of length N >> --> 2N flop >> >> and VecAXPY() for complex vectors of length >> N --> 8N flop >> >> >> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages >> --- -- Message Lengths -- -- Reductions -- >> >> Avg %Total Avg %Total Count %Total >> Avg %Total Count %Total >> >> 0: Main Stage: 1.1725e+02 100.0% 2.2033e+11 100.0% 0.000e+00 >> 0.0% 0.000e+00 0.0% 1.140e+02 86.4% >> >> >> >> ------------------------------------------------------------------------------------------------------------------------ >> >> See the 'Profiling' chapter of the users' manual for details on >> interpreting output. >> >> Phase summary info: >> >> Count: number of times phase was executed >> >> Time and Flop: Max - maximum over all processors >> >> Ratio - ratio of maximum to minimum over all processors >> >> Mess: number of messages sent >> >> AvgLen: average message length (bytes) >> >> Reduct: number of global reductions >> >> Global: entire computation >> >> Stage: stages of a computation. Set stages with PetscLogStagePush() >> and PetscLogStagePop(). >> >> %T - percent time in this phase %F - percent flop in this >> phase >> >> %M - percent messages in this phase %L - percent message >> lengths in this phase >> >> %R - percent reductions in this phase >> >> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time >> over all processors) >> >> GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max >> GPU time over all processors) >> >> CpuToGpu Count: total number of CPU to GPU copies per processor >> >> CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per >> processor) >> >> GpuToCpu Count: total number of GPU to CPU copies per processor >> >> GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per >> processor) >> >> GPU %F: percent flops on GPU in this event >> >> >> ------------------------------------------------------------------------------------------------------------------------ >> >> Event Count Time (sec) Flop >> --- Global --- --- Stage ---- Total GPU - CpuToGpu - - >> GpuToCpu - GPU >> >> Max Ratio Max Ratio Max Ratio Mess AvgLen >> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count >> Size %F >> >> >> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >> >> >> --- Event Stage 0: Main Stage >> >> >> BuildTwoSided 2 1.0 6.2501e-03145.1 0.00e+00 0.0 0.0e+00 >> 0.0e+00 2.0e+00 0 0 0 0 2 0 0 0 0 2 0 0 0 >> 0.00e+00 0 0.00e+00 0 >> >> BuildTwoSidedF 2 1.0 6.2628e-03123.2 0.00e+00 0.0 0.0e+00 >> 0.0e+00 2.0e+00 0 0 0 0 2 0 0 0 0 2 0 0 0 >> 0.00e+00 0 0.00e+00 0 >> >> VecDot 89991 1.1 3.4663e+00 1.2 1.67e+09 1.1 0.0e+00 0.0e+00 >> 0.0e+00 3 3 0 0 0 3 3 0 0 0 1816 1841 0 0.00e+00 >> 84992 6.80e-01 100 >> >> VecNorm 89991 1.1 5.5282e+00 1.2 1.67e+09 1.1 0.0e+00 0.0e+00 >> 0.0e+00 4 3 0 0 0 4 3 0 0 0 1139 1148 0 0.00e+00 >> 84992 6.80e-01 100 >> >> VecScale 89991 1.1 1.3902e+00 1.2 8.33e+08 1.1 0.0e+00 0.0e+00 >> 0.0e+00 1 1 0 0 0 1 1 0 0 0 2265 2343 84992 6.80e-01 0 >> 0.00e+00 100 >> >> VecCopy 178201 1.1 2.9825e+00 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 2 0 0 0 0 2 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> >> VecSet 3589 1.1 1.0195e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> >> VecAXPY 179091 1.1 2.7456e+00 1.2 3.32e+09 1.1 0.0e+00 0.0e+00 >> 0.0e+00 2 6 0 0 0 2 6 0 0 0 4564 4739 169142 1.35e+00 >> 0 0.00e+00 100 >> >> VecCUDACopyTo 891 1.1 1.5322e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 842 6.23e+01 0 >> 0.00e+00 0 >> >> VecCUDACopyFrom 891 1.1 1.5837e-02 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 842 >> 6.23e+01 0 >> >> DMCreateMat 5 1.0 7.3491e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 7.0e+00 1 0 0 0 5 1 0 0 0 6 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> >> SFSetGraph 5 1.0 3.5016e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> >> MatMult 89991 1.1 2.0423e+00 1.2 5.08e+10 1.1 0.0e+00 0.0e+00 >> 0.0e+00 2 87 0 0 0 2 87 0 0 0 94039 105680 1683 2.00e+03 0 >> 0.00e+00 100 >> >> MatCopy 891 1.1 1.3600e-01 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> >> MatConvert 2 1.0 1.0489e+00 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> >> MatScale 2 1.0 2.7950e-04 1.3 3.18e+05 1.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 4530 0 0 0.00e+00 0 >> 0.00e+00 0 >> >> MatAssemblyBegin 7 1.0 6.3768e-0368.8 0.00e+00 0.0 0.0e+00 0.0e+00 >> 2.0e+00 0 0 0 0 2 0 0 0 0 2 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> >> MatAssemblyEnd 7 1.0 7.9870e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 4.0e+00 0 0 0 0 3 0 0 0 0 4 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> >> MatCUSPARSCopyTo 891 1.1 1.5229e-01 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 842 1.93e+03 0 >> 0.00e+00 0 >> >> >> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >> >> Memory usage is given in bytes: >> >> >> Object Type Creations Destructions Memory Descendants' >> Mem. >> >> Reports information only for process 0. >> >> >> --- Event Stage 0: Main Stage >> >> >> Vector 69 11 19112 0. >> >> Distributed Mesh 3 0 0 0. >> >> Index Set 12 10 187512 0. >> >> IS L to G Mapping 3 0 0 0. >> >> Star Forest Graph 11 0 0 0. >> >> Discrete System 3 0 0 0. >> >> Weak Form 3 0 0 0. >> >> Application Order 1 0 0 0. >> >> Matrix 8 0 0 0. >> >> Krylov Solver 1 0 0 0. >> >> Preconditioner 1 0 0 0. >> >> Viewer 1 0 0 0. >> >> >> ======================================================================================================================== >> >> Average time to get PetscTime(): 4.32e-08 >> >> Average time for MPI_Barrier(): 9.94e-07 >> >> Average time for zero size MPI_Send(): 4.20135e-05 >> >> >> Sincerely, >> >> SG >> >> On Tue, Oct 19, 2021 at 12:28 AM Junchao Zhang <[email protected]> >> wrote: >> >>> >>> >>> >>> On Mon, Oct 18, 2021 at 10:56 PM Swarnava Ghosh <[email protected]> >>> wrote: >>> >>>> I am trying the port parts of the following function on GPUs. >>>> Essentially, the lines of codes between the two "TODO..." comments should >>>> be executed on the device. Here is the function: >>>> >>>> PetscScalar CalculateSpectralNodesAndWeights(LSDFT_OBJ *pLsdft, int p, >>>> int LIp) >>>> { >>>> >>>> PetscInt N_qp; >>>> N_qp = pLsdft->N_qp; >>>> >>>> int k; >>>> PetscScalar *a, *b; >>>> k=0; >>>> >>>> PetscMalloc(sizeof(PetscScalar)*(N_qp+1), &a); >>>> PetscMalloc(sizeof(PetscScalar)*(N_qp+1), &b); >>>> >>>> /* >>>> * TODO: COPY a, b, pLsdft->Vk, pLsdft->Vkm1, pLsdft->Vkp1, >>>> pLsdft->LapPlusVeffOprloc, k,p,N_qp from HOST to DEVICE >>>> * DO THE FOLLOWING OPERATIONS ON DEVICE >>>> */ >>>> >>>> //zero out vectors >>>> VecZeroEntries(pLsdft->Vk); >>>> VecZeroEntries(pLsdft->Vkm1); >>>> VecZeroEntries(pLsdft->Vkp1); >>>> >>>> VecSetValue(pLsdft->Vkm1, p, 1.0, INSERT_VALUES); >>>> MatMult(pLsdft->LapPlusVeffOprloc,pLsdft->Vkm1,pLsdft->Vk); >>>> VecDot(pLsdft->Vkm1, pLsdft->Vk, &a[0]); >>>> VecAXPY(pLsdft->Vk, -a[0], pLsdft->Vkm1); >>>> VecNorm(pLsdft->Vk, NORM_2, &b[0]); >>>> VecScale(pLsdft->Vk, 1.0 / b[0]); >>>> >>>> for (k = 0; k < N_qp; k++) { >>>> MatMult(pLsdft->LapPlusVeffOprloc,pLsdft->Vk,pLsdft->Vkp1); >>>> VecDot(pLsdft->Vk, pLsdft->Vkp1, &a[k + 1]); >>>> VecAXPY(pLsdft->Vkp1, -a[k + 1], pLsdft->Vk); >>>> VecAXPY(pLsdft->Vkp1, -b[k], pLsdft->Vkm1); >>>> VecCopy(pLsdft->Vk, pLsdft->Vkm1); >>>> VecNorm(pLsdft->Vkp1, NORM_2, &b[k + 1]); >>>> VecCopy(pLsdft->Vkp1, pLsdft->Vk); >>>> VecScale(pLsdft->Vk, 1.0 / b[k + 1]); >>>> } >>>> >>>> /* >>>> * TODO: Copy back a, b, pLsdft->Vk, pLsdft->Vkm1, pLsdft->Vkp1, >>>> pLsdft->LapPlusVeffOprloc, k,p,N_qp from DEVICE to HOST >>>> */ >>>> >>>> /* >>>> * Some operation with a, and b on HOST >>>> * >>>> */ >>>> TridiagEigenVecSolve_NodesAndWeights(pLsdft, a, b, N_qp, LIp); // >>>> operation on the host >>>> >>>> // free a,b >>>> PetscFree(a); >>>> PetscFree(b); >>>> >>>> return 0; >>>> } >>>> >>>> If I just use the command line options to set vectors Vk,Vkp1 and Vkm1 >>>> as cuda vectors and the matrix LapPlusVeffOprloc as aijcusparse, will the >>>> lines of code between the two "TODO" comments be entirely executed on the >>>> device? >>>> >>> yes, except VecSetValue(pLsdft->Vkm1, p, 1.0, INSERT_VALUES); which is >>> done on CPU, by pulling down vector data from GPU to CPU and setting the >>> value. Subsequent vector operations will push the updated vector data to >>> GPU again. >>> >>> >>>> >>>> Sincerely, >>>> Swarnava >>>> >>>> >>>> On Mon, Oct 18, 2021 at 10:13 PM Swarnava Ghosh <[email protected]> >>>> wrote: >>>> >>>>> Thanks for the clarification, Junchao. >>>>> >>>>> Sincerely, >>>>> Swarnava >>>>> >>>>> On Mon, Oct 18, 2021 at 10:08 PM Junchao Zhang < >>>>> [email protected]> wrote: >>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Oct 18, 2021 at 8:47 PM Swarnava Ghosh <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Junchao, >>>>>>> >>>>>>> If I want to pass command line options as -mymat_mat_type >>>>>>> aijcusparse, should it be MatSetOptionsPrefix(A,"mymat"); or >>>>>>> MatSetOptionsPrefix(A,"mymat_"); ? Could you please clarify? >>>>>>> >>>>>> my fault, it should be MatSetOptionsPrefix(A,"mymat_"), as seen in >>>>>> mat/tests/ex62.c >>>>>> Thanks >>>>>> >>>>>> >>>>>>> >>>>>>> Sincerely, >>>>>>> Swarnava >>>>>>> >>>>>>> On Mon, Oct 18, 2021 at 9:23 PM Junchao Zhang < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> MatSetOptionsPrefix(A,"mymat") >>>>>>>> VecSetOptionsPrefix(v,"myvec") >>>>>>>> >>>>>>>> --Junchao Zhang >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Oct 18, 2021 at 8:04 PM Chang Liu <[email protected]> wrote: >>>>>>>> >>>>>>>>> Hi Junchao, >>>>>>>>> >>>>>>>>> Thank you for your answer. I tried MatConvert and it works. I >>>>>>>>> didn't >>>>>>>>> make it before because I forgot to convert a vector from mpi to >>>>>>>>> mpicuda >>>>>>>>> previously. >>>>>>>>> >>>>>>>>> For vector, there is no VecConvert to use, so I have to do >>>>>>>>> VecDuplicate, >>>>>>>>> VecSetType and VecCopy. Is there an easier option? >>>>>>>>> >>>>>>>> As Matt suggested, you could single out the matrix and vector with >>>>>>>> options prefix and set their type on command line >>>>>>>> >>>>>>>> MatSetOptionsPrefix(A,"mymat"); >>>>>>>> VecSetOptionsPrefix(v,"myvec"); >>>>>>>> >>>>>>>> Then, -mymat_mat_type aijcusparse -myvec_vec_type cuda >>>>>>>> >>>>>>>> A simpler code is to have the vector type automatically set by >>>>>>>> MatCreateVecs(A,&v,NULL) >>>>>>>> >>>>>>>> >>>>>>>>> Chang >>>>>>>>> >>>>>>>>> On 10/18/21 5:23 PM, Junchao Zhang wrote: >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > On Mon, Oct 18, 2021 at 3:42 PM Chang Liu via petsc-users >>>>>>>>> > <[email protected] <mailto:[email protected]>> >>>>>>>>> wrote: >>>>>>>>> > >>>>>>>>> > Hi Matt, >>>>>>>>> > >>>>>>>>> > I have a related question. In my code I have many matrices >>>>>>>>> and I only >>>>>>>>> > want to have one living on GPU, the others still staying on >>>>>>>>> CPU mem. >>>>>>>>> > >>>>>>>>> > I wonder if there is an easier way to copy a mpiaij matrix to >>>>>>>>> > mpiaijcusparse (in other words, copy data to GPUs). I can >>>>>>>>> think of >>>>>>>>> > creating a new mpiaijcusparse matrix, and copying the data >>>>>>>>> line by >>>>>>>>> > line. >>>>>>>>> > But I wonder if there is a better option. >>>>>>>>> > >>>>>>>>> > I have tried MatCopy and MatConvert but neither work. >>>>>>>>> > >>>>>>>>> > Did you use MatConvert(mat,matype,MAT_INPLACE_MATRIX,&mat)? >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > Chang >>>>>>>>> > >>>>>>>>> > On 10/17/21 7:50 PM, Matthew Knepley wrote: >>>>>>>>> > > On Sun, Oct 17, 2021 at 7:12 PM Swarnava Ghosh >>>>>>>>> > <[email protected] <mailto:[email protected]> >>>>>>>>> > > <mailto:[email protected] <mailto:[email protected]>>> >>>>>>>>> wrote: >>>>>>>>> > > >>>>>>>>> > > Do I need convert the MATSEQBAIJ to a cuda matrix in >>>>>>>>> code? >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > You would need a call to MatSetFromOptions() to take that >>>>>>>>> type >>>>>>>>> > from the >>>>>>>>> > > command line, and not have >>>>>>>>> > > the type hard-coded in your application. It is generally >>>>>>>>> a bad >>>>>>>>> > idea to >>>>>>>>> > > hard code the implementation type. >>>>>>>>> > > >>>>>>>>> > > If I do it from command line, then are the other >>>>>>>>> MatVec calls are >>>>>>>>> > > ported onto CUDA? I have many MatVec calls in my >>>>>>>>> code, but I >>>>>>>>> > > specifically want to port just one call. >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > You can give that one matrix an options prefix to isolate >>>>>>>>> it. >>>>>>>>> > > >>>>>>>>> > > Thanks, >>>>>>>>> > > >>>>>>>>> > > Matt >>>>>>>>> > > >>>>>>>>> > > Sincerely, >>>>>>>>> > > Swarnava >>>>>>>>> > > >>>>>>>>> > > On Sun, Oct 17, 2021 at 7:07 PM Junchao Zhang >>>>>>>>> > > <[email protected] <mailto: >>>>>>>>> [email protected]> >>>>>>>>> > <mailto:[email protected] <mailto: >>>>>>>>> [email protected]>>> >>>>>>>>> > wrote: >>>>>>>>> > > >>>>>>>>> > > You can do that with command line options >>>>>>>>> -mat_type >>>>>>>>> > aijcusparse >>>>>>>>> > > -vec_type cuda >>>>>>>>> > > >>>>>>>>> > > On Sun, Oct 17, 2021, 5:32 PM Swarnava Ghosh >>>>>>>>> > > <[email protected] <mailto: >>>>>>>>> [email protected]> >>>>>>>>> > <mailto:[email protected] <mailto:[email protected]>>> >>>>>>>>> wrote: >>>>>>>>> > > >>>>>>>>> > > Dear Petsc team, >>>>>>>>> > > >>>>>>>>> > > I had a query regarding using CUDA to >>>>>>>>> accelerate a matrix >>>>>>>>> > > vector product. >>>>>>>>> > > I have a sequential sparse matrix >>>>>>>>> (MATSEQBAIJ type). >>>>>>>>> > I want >>>>>>>>> > > to port a MatVec call onto GPUs. Is there any >>>>>>>>> > code/example I >>>>>>>>> > > can look at? >>>>>>>>> > > >>>>>>>>> > > Sincerely, >>>>>>>>> > > SG >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > -- >>>>>>>>> > > What most experimenters take for granted before they >>>>>>>>> begin their >>>>>>>>> > > experiments is infinitely more interesting than any >>>>>>>>> results to which >>>>>>>>> > > their experiments lead. >>>>>>>>> > > -- Norbert Wiener >>>>>>>>> > > >>>>>>>>> > > https://www.cse.buffalo.edu/~knepley/ >>>>>>>>> > <https://www.cse.buffalo.edu/~knepley/> >>>>>>>>> > <http://www.cse.buffalo.edu/~knepley/ >>>>>>>>> > <http://www.cse.buffalo.edu/~knepley/>> >>>>>>>>> > >>>>>>>>> > -- >>>>>>>>> > Chang Liu >>>>>>>>> > Staff Research Physicist >>>>>>>>> > +1 609 243 3438 >>>>>>>>> > [email protected] <mailto:[email protected]> >>>>>>>>> > Princeton Plasma Physics Laboratory >>>>>>>>> > 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>> > >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Chang Liu >>>>>>>>> Staff Research Physicist >>>>>>>>> +1 609 243 3438 >>>>>>>>> [email protected] >>>>>>>>> Princeton Plasma Physics Laboratory >>>>>>>>> 100 Stellarator Rd, Princeton NJ 08540, USA >>>>>>>>> >>>>>>>> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > <http://www.cse.buffalo.edu/~knepley/> >
