On Thu, Jun 6, 2013 at 12:17 PM, Florian Rathgeber < [email protected]> wrote:
> On 02/05/13 21:35, Matthew Knepley wrote: > > On Thu, May 2, 2013 at 3:29 PM, Florian Rathgeber > > <[email protected] <mailto:[email protected]>> > wrote: > > > > On 02/05/13 03:12, Matthew Knepley wrote: > > > On Wed, May 1, 2013 at 8:52 PM, Karl Rupp <[email protected] > > <mailto:[email protected]> > > > <mailto:[email protected] <mailto:[email protected]>>> wrote: > > > > > > Hi Florian, > > > > > > > This is loosely a follow up to [1]. In this thread a few > > potential > > > ways > > > > > > for making GPU assembly work with PETSc were discussed and > > to me > > > the two > > > most promising appeared to be: > > > 1) Create a PETSc matrix from a pre-assembled CSR > > structure, or > > > 2) Preallocate a PETSc matrix and get the handle to pass > > the row > > > pointer, column indices and values array to a custom > assembly > > > routine. > > > > > > I still consider these two to be the most promising (and > general) > > > approaches. On the other hand, to my knowledge the > infrastructure > > > hasn't changed a lot since then. Some additional functionality > > from > > > CUSPARSE was added, while I added ViennaCL-bindings to branch > > 'next' > > > (i.e. still a few corners to polish). This means that you could > > > technically use the much more jit-friendly OpenCL (and, as a > > > follow-up, complain at NVIDIA and AMD over the higher > > latencies than > > > with CUDA). > > > > > > We compute > > > local assembly matrices on the GPU and a crucial > > requirement is > > > that the > > > matrix *only* lives in device device, we want to avoid any > > host <-> > > > device data transfers. > > > > > > One of the reasons why - despite its attractiveness - this > hasn't > > > taken off is because good preconditioners are typically still > > > required in such a setting. Other than the smoothed > aggregation in > > > CUSP, there is not much which does *not* require a copy to the > > host. > > > Particularly when thinking about multi-GPU you're entering the > > > regime where a good preconditioner on the CPU will still > > outperform > > > a GPU assembly with poor preconditioner. > > > > > > So far we have been using CUSP with a custom (generated) > > > assembly into > > > our own CUSP-compatible CSR data structure for a single > GPU. > > > Since CUSP > > > doesn't give us multi-GPU solvers out of the box we'd > > rather use > > > existing infrastructure that works rather than rolling our > > own. > > > > > > I guess this is good news for you: Steve Dalton will work with > us > > > during the summer to extend the CUSP-SA-AMG to distributed > memory. > > > Other than that, I think there's currently only the > functionality > > > from CUSPARSE and polynomial preconditioners, available > > through the > > > txpetscgpu package. > > > > > > Aside from that I also have a couple of plans on that front > > spinning > > > in my head, yet I couldn't find the time for implementing this > > yet. > > > > > > At the time of [1] supporting GPU assembly in one form or > the > > > other was > > > on the roadmap, but the implementation direction seemed to > not > > > have been > > > finally decided. Was there any progress since then or > anything > > > to add to > > > the discussion? Is there even (experimental) code we might > be > > > able to > > > use? Note that we're using petsc4py to interface to PETSc. > > > > > > Did you have a look at snes/examples/tutorials/ex52? I'm > currently > > > converting/extending this to OpenCL, so it serves as a > playground > > > for a future interface. Matt might have some additional > > comments on > > > this. > > > > > > I like to be very precise in the terminology. Doing the cell > integrals > > > on the GPU (integration) is worthwhile, whereas > > > inserting the element matrices into a global representation like > CSR > > > (assembly) takes no time and can be done > > > almost any way including on the CPU. I stopped working on assembly > > > because it made on difference. > > > > The actual insertion (as in MatSetValues) may not take up much time > on > > either the CPU or the GPU, provided it is done where the integration > was > > done. As I mentioned before we do both the integration and the solve > on > > the GPU. We don't even allocate data in host memory. Therefore it > > wouldn't make much sense to do the addto on the host since it would > > require device -> host data transfer of all the cell integrals and > host > > -> device of the CSR, which would make it quite expensive. > > > > One option we considered was creating a MatShell and providing an > SPMV > > callback, probably calling a CUSP kernel on each MPI rank. That > > restricts the available preconditioners, but as mentioned, without > doing > > any data transfers we'd be restricted to GPU-only preconditioners > > anyway. Any thoughts on this compared to the strategies mentioned > above? > > > > > > What about just creating your CUSP matrix and then shoving it into a > > MATAIJCUSP? > > That is what I did for my assembly tests. > > That'd be the ideal solution. Does this work with MPIAIJ? We're only > really interested in multi-GPU with MPI. In the sequential case we can > just call Cusp directly, but for the MPI distributed case we'd rather > rely on PETSc to help us out. > You would have to create the diagonal and off-diagonal matrices yourself. > Presumably you're referring to the experiments you did for the TOMS > paper? Is that code available somewhere? No, its for the TOMS paper I did not write because the result was not interesting enough I thought. The code is in PETSc. > > For GPU only preconditioners, would focus on the Cusp AMG using > > Chebychev for > > the smoothers. > > OK. Again we'd have to create our own PCShell for this when using > MatShell if I understand correctly? I don't think so since Cheby just uses a matrix action. Matt > > Florian > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
