On 08/20/14 13:14, Karl Rupp wrote:
Hey,

>>> Is there a way to run a calculation with 4*N MPI tasks where
my matrix is first built outside PETSc, then to solve the
linear system using PETSc Mat, Vec, KSP on only N MPI
tasks to adress efficiently the N GPUs ?

as far as I can tell, this should be possible with a suitable
subcommunicator. The tricky piece, however, is to select the right MPI
ranks for this. Note that you generally have no guarantee on how the
MPI ranks are distributed across the nodes, so be prepared for
something fairly specific to your MPI installation.
Yes, I am ready to face this point too.

Okay, good to know that you are aware of this.



I also started the work with a purely CPU-based solve only to test, but
without success. When
I read this:

"If you wish PETSc code to run ONLY on a subcommunicator of
MPI_COMM_WORLD, create that communicator first and assign it to
PETSC_COMM_WORLD
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PETSC_COMM_WORLD.html#PETSC_COMM_WORLD>
BEFORE calling PetscInitialize
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize>().

Thus if you are running a four process job and two processes will run
PETSc and have PetscInitialize
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize>()
and PetscFinalize
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscFinalize.html#PetscFinalize>()
and two process will not, then do this. If ALL processes in
the job are using PetscInitialize
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize>()
and PetscFinalize
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscFinalize.html#PetscFinalize>()
then you don't need to do this, even if different subcommunicators of
the job are doing different things with PETSc."

I think I am not in this special scenario, because as my matrix is
initially partitionned on 4
processes, I need to call PetscInitialize() on each 4 processes in order
to build the PETSc matrix
with MatSetValues. And my goal is after to solve the linear system on
only 2 processes... So
building a sub-communicator will really do the trick ? Or i miss something ?

oh, then I misunderstood your question. I thought that you want to run *your* code on 4N procs and let PETSc never see more than N procs when feeding the matrix.

Sorry, I was not very clear :-)
What you could do with 4N procs for PETSc is to define your own matrix layout, where only one out of four processes actually owns part of the matrix. After MatAssemblyBegin()/MatAssemblyEnd() the full data gets correctly transferred to N procs, with the other 3*N procs being 'empty'. You should then be able to run the solver with all 4*N processors, but only N of them actually do the work on the GPUs.
OK, I understand your solution, as I was already thinking about that, thanks to confirm it. But, my fear was that the performance was not improved. Indeed, I still don't understand (even after analyzing -log_summary profiles and searching in the petsc-dev archives) what is slowing down with several MPI tasks sharing one GPU, compared to one MPI task working with one GPU... In the proposed solution, 4*N processes will still exchange MPI messages during a KSP iteration, and the amount of data copy will be the same between GPU and CPU(s), so if you could enlighten
me, I will be glad.

A different question is whether you actually need all 4*N MPI ranks for the system assembly. You can make your life a lot easier if you only run with N MPI ranks upfront, particularly if the performance gains from N->4N procs in the assembly stage is small relative to the time spent in the solver.
Indeed, but it is not always small in our cases...
This may well be the case for memory bandwidth limited applications, where one process can utilize most of the available bandwidth. Either way, a test run with N procs will give you a good profiling baseline on whether you can expect any performance gains from GPUs in the solver stage overall. It may well be that you can get faster solver times with some fancy multigrid preconditioning techniques on a purely CPU-based implementation, which is unavailable on GPUs. Also, your system size needs to be sufficiently large (100k unknowns per GPU as a rule of thumb) to hide PCI-Express latencies.
Indeed, the rule of thumb seems 100-150k unknowns per GPU for my app.

Thanks Karli, I really appreciate your advices,

PL

Best regards,
Karli



--
*Trio_U support team*
Marthe ROUX (01 69 08 00 02) Saclay
Pierre LEDAC (04 38 78 91 49) Grenoble

Reply via email to