Hey,
>>> Is there a way to run a calculation with 4*N MPI tasks where
my matrix is first built outside PETSc, then to solve the
linear system using PETSc Mat, Vec, KSP on only N MPI
tasks to adress efficiently the N GPUs ?
as far as I can tell, this should be possible with a suitable
subcommunicator. The tricky piece, however, is to select the right MPI
ranks for this. Note that you generally have no guarantee on how the
MPI ranks are distributed across the nodes, so be prepared for
something fairly specific to your MPI installation.
Yes, I am ready to face this point too.
Okay, good to know that you are aware of this.
I also started the work with a purely CPU-based solve only to test, but
without success. When
I read this:
"If you wish PETSc code to run ONLY on a subcommunicator of
MPI_COMM_WORLD, create that communicator first and assign it to
PETSC_COMM_WORLD
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PETSC_COMM_WORLD.html#PETSC_COMM_WORLD>
BEFORE calling PetscInitialize
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize>().
Thus if you are running a four process job and two processes will run
PETSc and have PetscInitialize
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize>()
and PetscFinalize
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscFinalize.html#PetscFinalize>()
and two process will not, then do this. If ALL processes in
the job are using PetscInitialize
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscInitialize.html#PetscInitialize>()
and PetscFinalize
<http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Sys/PetscFinalize.html#PetscFinalize>()
then you don't need to do this, even if different subcommunicators of
the job are doing different things with PETSc."
I think I am not in this special scenario, because as my matrix is
initially partitionned on 4
processes, I need to call PetscInitialize() on each 4 processes in order
to build the PETSc matrix
with MatSetValues. And my goal is after to solve the linear system on
only 2 processes... So
building a sub-communicator will really do the trick ? Or i miss something ?
oh, then I misunderstood your question. I thought that you want to run
*your* code on 4N procs and let PETSc never see more than N procs when
feeding the matrix.
What you could do with 4N procs for PETSc is to define your own matrix
layout, where only one out of four processes actually owns part of the
matrix. After MatAssemblyBegin()/MatAssemblyEnd() the full data gets
correctly transferred to N procs, with the other 3*N procs being
'empty'. You should then be able to run the solver with all 4*N
processors, but only N of them actually do the work on the GPUs.
A different question is whether you actually need all 4*N MPI ranks for
the system assembly. You can make your life a lot easier if you only run
with N MPI ranks upfront, particularly if the performance gains from
N->4N procs in the assembly stage is small relative to the time spent in
the solver. This may well be the case for memory bandwidth limited
applications, where one process can utilize most of the available
bandwidth. Either way, a test run with N procs will give you a good
profiling baseline on whether you can expect any performance gains from
GPUs in the solver stage overall. It may well be that you can get faster
solver times with some fancy multigrid preconditioning techniques on a
purely CPU-based implementation, which is unavailable on GPUs. Also,
your system size needs to be sufficiently large (100k unknowns per GPU
as a rule of thumb) to hide PCI-Express latencies.
Best regards,
Karli