Hi, Does MatMult function is performed on GPU? when I prepared program which just executes this function with parameters -vec_type cuda and -mat_type seqaijcuda i havent seen in summary log any VecCUDACopyTo entry
Dnia 2010-12-11, sob o godzinie 11:50 -0600, Barry Smith pisze: > To answer this you need to understand that PETSc copies vectors and matrices > to the GPU memory "on demand" (that is exactly when they are first needed on > the GPU, and not before) and once it has copied to the GPU it keeps track of > it and will NOT copy it down again if it is already there. > > Hence in your run below, yes it includes the copy time down. > > But note that ONE multiply on the GPU is absurd, it does not make sense to > copy a matrix down to the GPU and then do ONE multiply with it. Thus I NEVER > do "sandalone" benchmarking where a single kernel is called by it self once, > the time results are useless. Always run a FULL application with > -log_summary; for example in this case a full KSPSolve() that requires a > bunch of iterations. Then you can look at the performance of each kernel. The > reason to do it this way is that the numbers can be very different and what > matters is runs in APPLICATIONS so that is what should be measured. > > If say you run KSP with 20 iterations then the time to copy the matrix > down to the GPU is amortized over those 20 iterations and thus maybe ok. You > should see the flop rate for the MatMult() go up in this case. > > You may have noticed we have a log entry for VecCopyToGPU() we will be > adding one for matrices as well thus you will be able to see how long the > copy time is but not that the copy time is still counted in the MatMult() > time if the first copy of the matrix to GPU is triggered by the MatMult. You > can subtract the copy time from the mult time to get the per multiply time, > this would correspond to the multiply time in the limit of a single copy down > and many, many multiplies on the GPU. > > Barry > > > > > On Dec 11, 2010, at 8:32 AM, Jakub Pola wrote: > > > Hello again, > > > > I compiled one of te examples. I used sparse matix called 02-raefsky3. > > I used -vec_type cuda and -mat_type seqaijcuda. > > > > When I see summary of the operations performed by program there is > > > > MatMult 1 1.0 2.0237e-02 1.0 2.98e+06 1.0 0.0e+00 0.0e+00 0.0e+00 2100 > > 0 0 0 2100 0 0 0 147 > > > > Does time of performing MatMult includes memory transfer for loading > > matrix in GPU memory or just exact computation time? > > > > Thanks in advance. > > Kuba. > > >
