Re: [petsc-dev] Parallel calculation on GPU

Projet_TRIOU Wed, 20 Aug 2014 04:35:50 -0700

On 08/20/14 13:14, Karl Rupp wrote:

Hey,
>>> Is there a way to run a calculation with 4*N MPI tasks where
my matrix is first built outside PETSc, then to solve the
linear system using PETSc Mat, Vec, KSP on only N MPI
tasks to adress efficiently the N GPUs ?
as far as I can tell, this should be possible with a suitable
subcommunicator. The tricky piece, however, is to select the right MPI
ranks for this. Note that you generally have no guarantee on how the
MPI ranks are distributed across the nodes, so be prepared for
something fairly specific to your MPI installation.
Yes, I am ready to face this point too.
Okay, good to know that you are aware of this.
I also started the work with a purely CPU-based solve only to test, but
without success. When
I read this:

"If you wish PETSc code to run ONLY on a subcommunicator of
MPI_COMM_WORLD, create that communicator first and assign it to
PETSC_COMM_WORLD
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PETSC_COMM_WORLD.html#PETSC_COMM_WORLD>
BEFORE calling PetscInitialize
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize>().
Thus if you are running a four process job and two processes will run
PETSc and have PetscInitialize
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize>()
and PetscFinalize
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscFinalize.html#PetscFinalize>()
and two process will not, then do this. If ALL processes in
the job are using PetscInitialize
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize>()
and PetscFinalize
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscFinalize.html#PetscFinalize>()
then you don't need to do this, even if different subcommunicators of
the job are doing different things with PETSc."

I think I am not in this special scenario, because as my matrix is
initially partitionned on 4
processes, I need to call PetscInitialize() on each 4 processes in order
to build the PETSc matrix
with MatSetValues. And my goal is after to solve the linear system on
only 2 processes... So
building a sub-communicator will really do the trick ? Or i misssomething ?
oh, then I misunderstood your question. I thought that you want to run*your* code on 4N procs and let PETSc never see more than N procs whenfeeding the matrix.

Sorry, I was not very clear :-)

What you could do with 4N procs for PETSc is to define your own matrixlayout, where only one out of four processes actually owns part of thematrix. After MatAssemblyBegin()/MatAssemblyEnd() the full data getscorrectly transferred to N procs, with the other 3*N procs being'empty'. You should then be able to run the solver with all 4*Nprocessors, but only N of them actually do the work on the GPUs.

OK, I understand your solution, as I was already thinking about that,thanks to confirm it. But, my fear was that the performance was notimproved. Indeed, I still don't understand (even afteranalyzing -log_summary profiles and searching in the petsc-dev archives)what is slowing down with several MPI tasks sharing one GPU, compared toone MPI task working with one GPU...In the proposed solution, 4*N processes will still exchange MPI messagesduring a KSP iteration, and the amount of data copy will be the samebetween GPU and CPU(s), so if you could enlighten

me, I will be glad.

A different question is whether you actually need all 4*N MPI ranksfor the system assembly. You can make your life a lot easier if youonly run with N MPI ranks upfront, particularly if the performancegains from N->4N procs in the assembly stage is small relative to thetime spent in the solver.

Indeed, but it is not always small in our cases...

This may well be the case for memory bandwidth limited applications,where one process can utilize most of the available bandwidth. Eitherway, a test run with N procs will give you a good profiling baselineon whether you can expect any performance gains from GPUs in thesolver stage overall. It may well be that you can get faster solvertimes with some fancy multigrid preconditioning techniques on a purelyCPU-based implementation, which is unavailable on GPUs. Also, yoursystem size needs to be sufficiently large (100k unknowns per GPU as arule of thumb) to hide PCI-Express latencies.

Indeed, the rule of thumb seems 100-150k unknowns per GPU for my app.

Thanks Karli, I really appreciate your advices,

PL


Best regards,
Karli



--
*Trio_U support team*
Marthe ROUX (01 69 08 00 02) Saclay
Pierre LEDAC (04 38 78 91 49) Grenoble

Re: [petsc-dev] Parallel calculation on GPU

Reply via email to