> On May 30, 2015, at 10:14 PM, Harshad Sahasrabudhe <[email protected]> 
> wrote:
> 
> Is your intent to solve a problem that matters in a way that makes sense for 
> a scientist or engineer
> 
> I want to see if we can speed up the time stepper for a large system using 
> GPUs. For large systems with sparse matrix of size 420,000^2, each time step 
> takes 341 sec on a single process and 180 seconds on 16 processes.

   Rather than going off on a wild goose chase it would be good to understand 
WHY 1) the time on one process is so poor and 2) why the speedup to 16 
processes is so low. These means gathering information and then analyzing that 
information. 

So first you need to measure the memory bandwidth of your system for 1 to 16 
processes. This is explained at 
http://www.mcs.anl.gov/petsc/documentation/faq.html#computers by running the 
"make streams NPMAX=16" then use the MPI "binding" options to see if that 
improves the streams numbers.   What do you get for these numbers?

Next you need to run your PETSc application with -log_summary to see how much 
time is in the linear solve and how what it is doing in the linear solve and 
how much time is spent in each part of the linear solve and how many iterations 
it is taking. To start run with -log_summary and 1 MPI process, 2 MPI 
processes, 4 MPI processes, 8 MPI processes and 16 MPI processes. What do you 
get for these numbers?

   In addition to 1) and 2) you need to determine what is a good preconditioner 
for YOUR problem. Linear iterative solvers are not black box solvers, using an 
inappropriate preconditioner can have many orders of magnitude difference on 
solution time (more than changing the hardware). If your problem is a nice 
elliptic operator than like -pc_type gamg might work well (or you can try the 
external packages -pc_type hypre or -pc_type ml ; requires installing those 
optional external packages; see 
http://www.mcs.anl.gov/petsc/documentation/linearsolvertable.html.  If your 
problem is a saddle-point (eg. Stokes) problem then you likely need to use the 
PCFIELDSPLIT preconditioner to "pull out" the saddle-point part. For more 
complicated simulations you will need nesting of several preconditioners.

  Barry





> So the scaling isn't that good. We also run out of memory with more number of 
> processes. 
> 
> On Sat, May 30, 2015 at 11:01 PM, Jed Brown <[email protected]> wrote:
> Harshad Sahasrabudhe <[email protected]> writes:
> > For now, I want to serialize the matrices and vectors and offload them to 1
> > GPU from the root process. Then distribute the result later.
> 
> Unless you have experience with these solvers and the overheads
> involved, I think you should expect this to be much slower than simply
> doing the solves using a reasonable method in the CPU.  Is your intent
> to solve a problem that matters in a way that makes sense for a
> scientist or engineer, or is it to demonstrate that a particular
> combination of packages/methods/hardware can be used?
> 

Reply via email to