Hi Dave, > I'm not aware of any polynomial preconditioners for the gpu available in > petsc with > or without the txpetscgpu package. I'd love to try them out if they were > though > and would love to hear that I am wrong.
Hmm, Paul mentioned the following paper a couple of weeks back: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=6319205&contentType=Conference+Publications from which I concluded that this is already part of the txpetscgpu package. Paul, this is the case, isn't it? Best regards, Karli > > ________________________________________ > From: petsc-dev-bounces at mcs.anl.gov [petsc-dev-bounces at mcs.anl.gov] on > behalf of Karl Rupp [rupp at mcs.anl.gov] > Sent: Wednesday, May 01, 2013 7:52 PM > To: petsc-dev at mcs.anl.gov > Subject: Re: [petsc-dev] PETSc multi-GPU assembly - current status > > Hi Florian, > > > This is loosely a follow up to [1]. In this thread a few potential ways >> for making GPU assembly work with PETSc were discussed and to me the two >> most promising appeared to be: >> 1) Create a PETSc matrix from a pre-assembled CSR structure, or >> 2) Preallocate a PETSc matrix and get the handle to pass the row >> pointer, column indices and values array to a custom assembly routine. > > I still consider these two to be the most promising (and general) > approaches. On the other hand, to my knowledge the infrastructure hasn't > changed a lot since then. Some additional functionality from CUSPARSE > was added, while I added ViennaCL-bindings to branch 'next' (i.e. still > a few corners to polish). This means that you could technically use the > much more jit-friendly OpenCL (and, as a follow-up, complain at NVIDIA > and AMD over the higher latencies than with CUDA). > >> We compute >> local assembly matrices on the GPU and a crucial requirement is that the >> matrix *only* lives in device device, we want to avoid any host <-> >> device data transfers. > > One of the reasons why - despite its attractiveness - this hasn't taken > off is because good preconditioners are typically still required in such > a setting. Other than the smoothed aggregation in CUSP, there is not > much which does *not* require a copy to the host. Particularly when > thinking about multi-GPU you're entering the regime where a good > preconditioner on the CPU will still outperform a GPU assembly with poor > preconditioner. > > >> So far we have been using CUSP with a custom (generated) assembly into >> our own CUSP-compatible CSR data structure for a single GPU. Since CUSP >> doesn't give us multi-GPU solvers out of the box we'd rather use >> existing infrastructure that works rather than rolling our own. > > I guess this is good news for you: Steve Dalton will work with us during > the summer to extend the CUSP-SA-AMG to distributed memory. Other than > that, I think there's currently only the functionality from CUSPARSE and > polynomial preconditioners, available through the txpetscgpu package. > > Aside from that I also have a couple of plans on that front spinning in > my head, yet I couldn't find the time for implementing this yet. > > >> At the time of [1] supporting GPU assembly in one form or the other was >> on the roadmap, but the implementation direction seemed to not have been >> finally decided. Was there any progress since then or anything to add to >> the discussion? Is there even (experimental) code we might be able to >> use? Note that we're using petsc4py to interface to PETSc. > > Did you have a look at snes/examples/tutorials/ex52? I'm currently > converting/extending this to OpenCL, so it serves as a playground for a > future interface. Matt might have some additional comments on this. > > Best regards, > Karli >
