On Thu, May 2, 2013 at 3:29 PM, Florian Rathgeber < florian.rathgeber at gmail.com> wrote:
> On 02/05/13 03:12, Matthew Knepley wrote: > > On Wed, May 1, 2013 at 8:52 PM, Karl Rupp <rupp at mcs.anl.gov > > <mailto:rupp at mcs.anl.gov>> wrote: > > > > Hi Florian, > > > > > This is loosely a follow up to [1]. In this thread a few potential > > ways > > > > for making GPU assembly work with PETSc were discussed and to me > > the two > > most promising appeared to be: > > 1) Create a PETSc matrix from a pre-assembled CSR structure, or > > 2) Preallocate a PETSc matrix and get the handle to pass the row > > pointer, column indices and values array to a custom assembly > > routine. > > > > I still consider these two to be the most promising (and general) > > approaches. On the other hand, to my knowledge the infrastructure > > hasn't changed a lot since then. Some additional functionality from > > CUSPARSE was added, while I added ViennaCL-bindings to branch 'next' > > (i.e. still a few corners to polish). This means that you could > > technically use the much more jit-friendly OpenCL (and, as a > > follow-up, complain at NVIDIA and AMD over the higher latencies than > > with CUDA). > > > > We compute > > local assembly matrices on the GPU and a crucial requirement is > > that the > > matrix *only* lives in device device, we want to avoid any host > <-> > > device data transfers. > > > > One of the reasons why - despite its attractiveness - this hasn't > > taken off is because good preconditioners are typically still > > required in such a setting. Other than the smoothed aggregation in > > CUSP, there is not much which does *not* require a copy to the host. > > Particularly when thinking about multi-GPU you're entering the > > regime where a good preconditioner on the CPU will still outperform > > a GPU assembly with poor preconditioner. > > > > So far we have been using CUSP with a custom (generated) > > assembly into > > our own CUSP-compatible CSR data structure for a single GPU. > > Since CUSP > > doesn't give us multi-GPU solvers out of the box we'd rather use > > existing infrastructure that works rather than rolling our own. > > > > I guess this is good news for you: Steve Dalton will work with us > > during the summer to extend the CUSP-SA-AMG to distributed memory. > > Other than that, I think there's currently only the functionality > > from CUSPARSE and polynomial preconditioners, available through the > > txpetscgpu package. > > > > Aside from that I also have a couple of plans on that front spinning > > in my head, yet I couldn't find the time for implementing this yet. > > > > At the time of [1] supporting GPU assembly in one form or the > > other was > > on the roadmap, but the implementation direction seemed to not > > have been > > finally decided. Was there any progress since then or anything > > to add to > > the discussion? Is there even (experimental) code we might be > > able to > > use? Note that we're using petsc4py to interface to PETSc. > > > > Did you have a look at snes/examples/tutorials/ex52? I'm currently > > converting/extending this to OpenCL, so it serves as a playground > > for a future interface. Matt might have some additional comments on > > this. > > > > I like to be very precise in the terminology. Doing the cell integrals > > on the GPU (integration) is worthwhile, whereas > > inserting the element matrices into a global representation like CSR > > (assembly) takes no time and can be done > > almost any way including on the CPU. I stopped working on assembly > > because it made on difference. > > The actual insertion (as in MatSetValues) may not take up much time on > either the CPU or the GPU, provided it is done where the integration was > done. As I mentioned before we do both the integration and the solve on > the GPU. We don't even allocate data in host memory. Therefore it > wouldn't make much sense to do the addto on the host since it would > require device -> host data transfer of all the cell integrals and host > -> device of the CSR, which would make it quite expensive. > > One option we considered was creating a MatShell and providing an SPMV > callback, probably calling a CUSP kernel on each MPI rank. That > restricts the available preconditioners, but as mentioned, without doing > any data transfers we'd be restricted to GPU-only preconditioners > anyway. Any thoughts on this compared to the strategies mentioned above? > What about just creating your CUSP matrix and then shoving it into a MATAIJCUSP? That is what I did for my assembly tests. For GPU only preconditioners, would focus on the Cusp AMG using Chebychev for the smoothers. Matt > Thanks, > Florian > > > Thanks, > > > > Matt > > > > > > Best regards, > > Karli > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20130502/46a0dce5/attachment.html>
