________________________________________
From: petsc-users-bounces at mcs.anl.gov [petsc-users-bounces at mcs.anl.gov] 
on behalf of Barry Smith [[email protected]]
Sent: Wednesday, February 27, 2013 11:53 AM
To: PETSc users list
Subject: Re: [petsc-users] Problems exceeding 192 million unknowns in FEM       
code

   This sounds like an OpenMPI issue with the matrix element communication and 
assembly process. In this process PETSc "stashes" any values set on the "wrong" 
process, then at MatAssemblyBegin() each process computes
how much data it is sending to each other process,  tells each other process 
how much data to expect (the receivers then post receives), then actually sends 
the data. We have used this code over many years on many systems so it is 
likely to be relatively bug free. My feeling is that the OpenMPI  
hardware/software combination is getting overwhelmed with all the messages that 
need to be sent around.

   Since you are handling this communication completely differently with 
Trilinos it doesn't have this problem.
________________________________________

When a job using the Trilinos code goes south, if often dies at about the same 
place as the code with PETSc often does, at its equivalent of the 
MatAssemblyBegin()/MatAssemblyEnd() call (i.e. GlobalAssemble()). I suspect 
that I may have a similar underlying problem in both the PETSc and Trilinos 
versions of my code, though the details in how the problem is expressed may 
differ.

________________________________________

   Since we can't change the hardware the first thing to check is how much 
(what percentage) of matrix entries need to be stashed and moved between 
processes. Run the 100 by 100 by 100 on 8 cores with the -info option and send 
the resulting output (it may be a lot) to petsc-maint at mcs.anl.gov (not 
petsc-users) and we'll show you how to determine how much data is being moved 
around. Perhaps you could also send the output from the successful 800 core run 
(with -info).
________________________________________

Done.

Reply via email to