On Thu, Dec 1, 2011 at 5:39 AM, Fredrik Heffer Valdmanis <fredva at ifi.uio.no>wrote:
> > 2011/11/29 Matthew Knepley <knepley at gmail.com> > >> On Tue, Nov 29, 2011 at 10:37 AM, Fredrik Heffer Valdmanis < >> fredva at ifi.uio.no> wrote: >> >>> 2011/11/29 Matthew Knepley <knepley at gmail.com> >>> >>>> On Tue, Nov 29, 2011 at 2:38 AM, Fredrik Heffer Valdmanis < >>>> fredva at ifi.uio.no> wrote: >>>> >>>>> 2011/10/28 Matthew Knepley <knepley at gmail.com> >>>>> >>>>>> On Fri, Oct 28, 2011 at 10:24 AM, Fredrik Heffer Valdmanis < >>>>>> fredva at ifi.uio.no> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I am working on integrating the new GPU based vectors and matrices >>>>>>> into FEniCS. Now, I'm looking at the possibility for getting some >>>>>>> speedup >>>>>>> during finite element assembly, specifically when inserting the local >>>>>>> element matrix into the global element matrix. In that regard, I have a >>>>>>> few >>>>>>> questions I hope you can help me out with: >>>>>>> >>>>>>> - When calling MatSetValues with a MATSEQAIJCUSP matrix as >>>>>>> parameter, what exactly is it that happens? As far as I can see, >>>>>>> MatSetValues is not implemented for GPU based matrices, neither is >>>>>>> the mat->ops->setvalues set to point at any function for this Mat type. >>>>>>> >>>>>> >>>>>> Yes, MatSetValues always operates on the CPU side. It would not make >>>>>> sense to do individual operations on the GPU. >>>>>> >>>>>> I have written batched of assembly for element matrices that are all >>>>>> the same size: >>>>>> >>>>>> >>>>>> http://www.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/docs/manualpages/Mat/MatSetValuesBatch.html >>>>>> >>>>>> >>>>>>> - Is it such that matrices are assembled in their entirety on the >>>>>>> CPU, and then copied over to the GPU (after calling MatAssemblyBegin)? >>>>>>> Or >>>>>>> are values copied over to the GPU each time you call MatSetValues? >>>>>>> >>>>>> >>>>>> That function assembles the matrix on the GPU and then copies to the >>>>>> CPU. The only time you do not want this copy is when >>>>>> you are running in serial and never touch the matrix afterwards, so I >>>>>> left it in. >>>>>> >>>>>> >>>>>>> - Can we expect to see any speedup from using MatSetValuesBatch over >>>>>>> MatSetValues, or is the batch version simply a utility function? This >>>>>>> question goes for both CPU- and GPU-based matrices. >>>>>>> >>>>>> >>>>>> CPU: no >>>>>> >>>>>> GPU: yes, I see about the memory bandwidth ratio >>>>>> >>>>>> >>>>>> Hi, >>>>> >>>>> I have now integrated MatSetValuesBatch in our existing PETSc wrapper >>>>> layer. I have tested matrix assembly with Poisson's equation on different >>>>> meshes with elements of varying order. I have timed the single call to >>>>> MatSetValuesBatch and compared that to the total time consumed by the >>>>> repeated calls to MatSetValues in the old implementation. I have the >>>>> following results: >>>>> >>>>> Poisson on 1000x1000 unit square, 1st order Lagrange elements: >>>>> MatSetValuesBatch: 0.88576 s >>>>> repeated calls to MatSetValues: 0.76654 s >>>>> >>>>> Poisson on 500x500 unit square, 2nd order Lagrange elements: >>>>> MatSetValuesBatch: 0.9324 s >>>>> repeated calls to MatSetValues: 0.81644 s >>>>> >>>>> Poisson on 300x300 unit square, 3rd order Lagrange elements: >>>>> MatSetValuesBatch: 0.93988 s >>>>> repeated calls to MatSetValues: 1.03884 s >>>>> >>>>> As you can see, the two methods take almost the same amount of time. >>>>> What behavior and performance should we expect? Is there any way to >>>>> optimize the performance of batched assembly? >>>>> >>>> >>>> Almost certainly it is not dispatching to the CUDA version. The regular >>>> version just calls MatSetValues() in a loop. Are you >>>> using a SEQAIJCUSP matrix? >>>> >>> Yes. The same matrices yields a speedup of 4-6x when solving the system >>> on the GPU. >>> >> >> Please confirm that the correct routine by running wth -info and sending >> the output. >> >> Please send the output of -log_summary so I can confirm the results. >> >> You can run KSP ex4 and reproduce my results where I see a 5.5x speedup >> on the GTX285 >> >> I am not sure what to look for in those outputs. I have uploaded the > output of running my assembly program with -info and -log_summary, and the > output of running ex4 with -log_summary. See > > http://folk.uio.no/fredva/assembly_info.txt > http://folk.uio.no/fredva/assembly_log_summary.txt > http://folk.uio.no/fredva/ex4_log_summary.txt > > Trying this on a different machine now, I actually see some speedup. 3rd > order Poisson on 300x300 assembles in 0.211 sec on GPU and 0.4232 sec on > CPU. For 1st order and 1000x1000 mesh, I go from 0.31 sec to 0.205 sec. > I have tried to increase the mesh size to see if the speedup increases, > but I hit the bad_alloc error pretty quick. > > For a problem of that size, should I expect even more speedup? Please let > me know if you need any more output from test runs on my machine. > Here are my results for nxn grids where n = range(150, 1350, 100). This is using a GTX 285. What card are you using? Matt > -- > Fredrik > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20111201/047edf0c/attachment-0001.htm> -------------- next part -------------- A non-text attachment was scrubbed... Name: AssemblyResults.pdf Type: application/pdf Size: 63138 bytes Desc: not available URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20111201/047edf0c/attachment-0001.pdf>
