On Tue, Nov 29, 2011 at 10:37 AM, Fredrik Heffer Valdmanis < fredva at ifi.uio.no> wrote:
> 2011/11/29 Matthew Knepley <knepley at gmail.com> > >> On Tue, Nov 29, 2011 at 2:38 AM, Fredrik Heffer Valdmanis < >> fredva at ifi.uio.no> wrote: >> >>> 2011/10/28 Matthew Knepley <knepley at gmail.com> >>> >>>> On Fri, Oct 28, 2011 at 10:24 AM, Fredrik Heffer Valdmanis < >>>> fredva at ifi.uio.no> wrote: >>>> >>>>> Hi, >>>>> >>>>> I am working on integrating the new GPU based vectors and matrices >>>>> into FEniCS. Now, I'm looking at the possibility for getting some speedup >>>>> during finite element assembly, specifically when inserting the local >>>>> element matrix into the global element matrix. In that regard, I have a >>>>> few >>>>> questions I hope you can help me out with: >>>>> >>>>> - When calling MatSetValues with a MATSEQAIJCUSP matrix as parameter, >>>>> what exactly is it that happens? As far as I can see, MatSetValues is not >>>>> implemented for GPU based matrices, neither is the mat->ops->setvalues set >>>>> to point at any function for this Mat type. >>>>> >>>> >>>> Yes, MatSetValues always operates on the CPU side. It would not make >>>> sense to do individual operations on the GPU. >>>> >>>> I have written batched of assembly for element matrices that are all >>>> the same size: >>>> >>>> >>>> http://www.mcs.anl.gov/petsc/petsc-as/snapshots/petsc-current/docs/manualpages/Mat/MatSetValuesBatch.html >>>> >>>> >>>>> - Is it such that matrices are assembled in their entirety on the CPU, >>>>> and then copied over to the GPU (after calling MatAssemblyBegin)? Or are >>>>> values copied over to the GPU each time you call MatSetValues? >>>>> >>>> >>>> That function assembles the matrix on the GPU and then copies to the >>>> CPU. The only time you do not want this copy is when >>>> you are running in serial and never touch the matrix afterwards, so I >>>> left it in. >>>> >>>> >>>>> - Can we expect to see any speedup from using MatSetValuesBatch over >>>>> MatSetValues, or is the batch version simply a utility function? This >>>>> question goes for both CPU- and GPU-based matrices. >>>>> >>>> >>>> CPU: no >>>> >>>> GPU: yes, I see about the memory bandwidth ratio >>>> >>>> >>>> Hi, >>> >>> I have now integrated MatSetValuesBatch in our existing PETSc wrapper >>> layer. I have tested matrix assembly with Poisson's equation on different >>> meshes with elements of varying order. I have timed the single call to >>> MatSetValuesBatch and compared that to the total time consumed by the >>> repeated calls to MatSetValues in the old implementation. I have the >>> following results: >>> >>> Poisson on 1000x1000 unit square, 1st order Lagrange elements: >>> MatSetValuesBatch: 0.88576 s >>> repeated calls to MatSetValues: 0.76654 s >>> >>> Poisson on 500x500 unit square, 2nd order Lagrange elements: >>> MatSetValuesBatch: 0.9324 s >>> repeated calls to MatSetValues: 0.81644 s >>> >>> Poisson on 300x300 unit square, 3rd order Lagrange elements: >>> MatSetValuesBatch: 0.93988 s >>> repeated calls to MatSetValues: 1.03884 s >>> >>> As you can see, the two methods take almost the same amount of time. >>> What behavior and performance should we expect? Is there any way to >>> optimize the performance of batched assembly? >>> >> >> Almost certainly it is not dispatching to the CUDA version. The regular >> version just calls MatSetValues() in a loop. Are you >> using a SEQAIJCUSP matrix? >> > Yes. The same matrices yields a speedup of 4-6x when solving the system on > the GPU. > Please confirm that the correct routine by running wth -info and sending the output. Please send the output of -log_summary so I can confirm the results. You can run KSP ex4 and reproduce my results where I see a 5.5x speedup on the GTX285 Matt > >> >>> I also have a problem with Thrust throwing std::bad_alloc on some >>> calls to MatSetValuesBatch. The exception originates in >>> thrust::device_ptr<void> thrust::detail::device::cuda::malloc<0u>(unsigned >>> long). It seems to be thrown when the number of double values I send to >>> MatSetValuesBatch approaches 30 million. I am testing this on a laptop with >>> 4 GB RAM and a GeForce 540 M (1 GB memory), so the 30 million doubles are >>> far away from exhausting my memory, both on the host and device side. Any >>> clues on what causes this problem and how to avoid it? >>> >> >> It uses more memory that just the values. I would have to look at the >> specific case, but >> I assume that the memory is exhausted. >> > OK, I can look further into it myself as well. Thanks, > > Fredrik > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20111129/09bdec62/attachment-0001.htm>
