Re: [petsc-dev] Parallel calculation on GPU

Projet_TRIOU Wed, 20 Aug 2014 07:40:08 -0700

On 08/20/14 16:03, Karl Rupp wrote:

What you could do with 4N procs for PETSc is to define your own matrix
layout, where only one out of four processes actually owns part of the
matrix. After MatAssemblyBegin()/MatAssemblyEnd() the full data gets
correctly transferred to N procs, with the other 3*N procs being
'empty'. You should then be able to run the solver with all 4*N
processors, but only N of them actually do the work on the GPUs.
OK, I understand your solution, as I was already thinking about that,
thanks to confirm it. But, my fear was that the performance was not
improved. Indeed, I still don't understand (even after
analyzing -log_summary profiles and searching in the petsc-dev archives)
what is slowing down with several MPI tasks sharing one GPU, compared to
one MPI task working with one GPU...
In the proposed solution, 4*N processes will still exchange MPI messages
during a KSP iteration, and the amount of data copy will be the same
between GPU and CPU(s), so if you could enlighten
me, I will be glad.
One of the causes of the performance penalty you observe is the higherPCI-Express communication: If four ranks share a single GPU, then eachmatrix-vector product requires at least 8 vector transfers betweenhost and device, rather than just 2 with a single MPI rank. Similarly,you have four times the number of kernel launches. It may well be thatthese overheads just eat up all the performance gains you couldotherwise obtain. I don't know your profiling data, so I can't be morespecific at this point.

Thanks a lot Karli for the explanations. I am currently trying yoursolution.


Pierre


Best regards,
Karli



--
*Trio_U support team*
Marthe ROUX (01 69 08 00 02) Saclay
Pierre LEDAC (04 38 78 91 49) Grenoble

Re: [petsc-dev] Parallel calculation on GPU

Reply via email to