Re: [petsc-dev] Parallel calculation on GPU

Karl Rupp Wed, 20 Aug 2014 04:15:36 -0700

Hey,

>>> Is there a way to run a calculation with 4*N MPI tasks where

my matrix is first built outside PETSc, then to solve the
linear system using PETSc Mat, Vec, KSP on only N MPI
tasks to adress efficiently the N GPUs ?


as far as I can tell, this should be possible with a suitable
subcommunicator. The tricky piece, however, is to select the right MPI
ranks for this. Note that you generally have no guarantee on how the
MPI ranks are distributed across the nodes, so be prepared for
something fairly specific to your MPI installation.

Yes, I am ready to face this point too.


Okay, good to know that you are aware of this.

I also started the work with a purely CPU-based solve only to test, but
without success. When
I read this:

"If you wish PETSc code to run ONLY on a subcommunicator of
MPI_COMM_WORLD, create that communicator first and assign it to
PETSC_COMM_WORLD
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PETSC_COMM_WORLD.html#PETSC_COMM_WORLD>
BEFORE calling PetscInitialize
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize>().

Thus if you are running a four process job and two processes will run
PETSc and have PetscInitialize
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize>()
and PetscFinalize
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscFinalize.html#PetscFinalize>()
and two process will not, then do this. If ALL processes in
the job are using PetscInitialize
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize>()
and PetscFinalize
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscFinalize.html#PetscFinalize>()
then you don't need to do this, even if different subcommunicators of
the job are doing different things with PETSc."

I think I am not in this special scenario, because as my matrix is
initially partitionned on 4
processes, I need to call PetscInitialize() on each 4 processes in order
to build the PETSc matrix
with MatSetValues. And my goal is after to solve the linear system on
only 2 processes... So
building a sub-communicator will really do the trick ? Or i miss something ?

oh, then I misunderstood your question. I thought that you want to run*your* code on 4N procs and let PETSc never see more than N procs whenfeeding the matrix.

What you could do with 4N procs for PETSc is to define your own matrixlayout, where only one out of four processes actually owns part of thematrix. After MatAssemblyBegin()/MatAssemblyEnd() the full data getscorrectly transferred to N procs, with the other 3*N procs being'empty'. You should then be able to run the solver with all 4*Nprocessors, but only N of them actually do the work on the GPUs.

A different question is whether you actually need all 4*N MPI ranks forthe system assembly. You can make your life a lot easier if you only runwith N MPI ranks upfront, particularly if the performance gains fromN->4N procs in the assembly stage is small relative to the time spent inthe solver. This may well be the case for memory bandwidth limitedapplications, where one process can utilize most of the availablebandwidth. Either way, a test run with N procs will give you a goodprofiling baseline on whether you can expect any performance gains fromGPUs in the solver stage overall. It may well be that you can get fastersolver times with some fancy multigrid preconditioning techniques on apurely CPU-based implementation, which is unavailable on GPUs. Also,your system size needs to be sufficiently large (100k unknowns per GPUas a rule of thumb) to hide PCI-Express latencies.


Best regards,
Karli

Re: [petsc-dev] Parallel calculation on GPU

Reply via email to