Hi Florian, > This is loosely a follow up to [1]. In this thread a few potential ways > for making GPU assembly work with PETSc were discussed and to me the two > most promising appeared to be: > 1) Create a PETSc matrix from a pre-assembled CSR structure, or > 2) Preallocate a PETSc matrix and get the handle to pass the row > pointer, column indices and values array to a custom assembly routine.
I still consider these two to be the most promising (and general) approaches. On the other hand, to my knowledge the infrastructure hasn't changed a lot since then. Some additional functionality from CUSPARSE was added, while I added ViennaCL-bindings to branch 'next' (i.e. still a few corners to polish). This means that you could technically use the much more jit-friendly OpenCL (and, as a follow-up, complain at NVIDIA and AMD over the higher latencies than with CUDA). > We compute > local assembly matrices on the GPU and a crucial requirement is that the > matrix *only* lives in device device, we want to avoid any host <-> > device data transfers. One of the reasons why - despite its attractiveness - this hasn't taken off is because good preconditioners are typically still required in such a setting. Other than the smoothed aggregation in CUSP, there is not much which does *not* require a copy to the host. Particularly when thinking about multi-GPU you're entering the regime where a good preconditioner on the CPU will still outperform a GPU assembly with poor preconditioner. > So far we have been using CUSP with a custom (generated) assembly into > our own CUSP-compatible CSR data structure for a single GPU. Since CUSP > doesn't give us multi-GPU solvers out of the box we'd rather use > existing infrastructure that works rather than rolling our own. I guess this is good news for you: Steve Dalton will work with us during the summer to extend the CUSP-SA-AMG to distributed memory. Other than that, I think there's currently only the functionality from CUSPARSE and polynomial preconditioners, available through the txpetscgpu package. Aside from that I also have a couple of plans on that front spinning in my head, yet I couldn't find the time for implementing this yet. > At the time of [1] supporting GPU assembly in one form or the other was > on the roadmap, but the implementation direction seemed to not have been > finally decided. Was there any progress since then or anything to add to > the discussion? Is there even (experimental) code we might be able to > use? Note that we're using petsc4py to interface to PETSc. Did you have a look at snes/examples/tutorials/ex52? I'm currently converting/extending this to OpenCL, so it serves as a playground for a future interface. Matt might have some additional comments on this. Best regards, Karli
