On Tue, Oct 26, 2010 at 04:52:22PM -0600, Dave Wade-Stein wrote:

> We have a customer for whom parallel I/O is hanging, and they are
> using -id as described above. We're trying to pinpoint why parallel
> I/O is not working on their system, which is CentOS 5.5 cluster.

It would be really helpful to see the state of these processes when a
hang occurs.  Are they stuck in an i/o call?  stuck in a collective
because not everyone participated?  if they are stuck in a collective,
is it an I/O collective or a messaging collective?

How parallel is this program?  If we're talking 4-way or 8-way
parallelism then maybe one can run it in gdb and collect a backtrace
of all the processors?   (mpiexec -np 8 xterm -e gdb ...)

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to