On Tue, Oct 26, 2010 at 04:52:22PM -0600, Dave Wade-Stein wrote: > We have a customer for whom parallel I/O is hanging, and they are > using -id as described above. We're trying to pinpoint why parallel > I/O is not working on their system, which is CentOS 5.5 cluster.
It would be really helpful to see the state of these processes when a hang occurs. Are they stuck in an i/o call? stuck in a collective because not everyone participated? if they are stuck in a collective, is it an I/O collective or a messaging collective? How parallel is this program? If we're talking 4-way or 8-way parallelism then maybe one can run it in gdb and collect a backtrace of all the processors? (mpiexec -np 8 xterm -e gdb ...) ==rob -- Rob Latham Mathematics and Computer Science Division Argonne National Lab, IL USA _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
