Wasn't aware of IOR, thank for the tip. We'll give that a try. Dave
On Oct 26, 2010, at 5:45 PM, Mark Howison wrote: > Have you tried using a benchmark like IOR to stress the NFS file > system? Maybe it is a problem with NFS and not the underlying file > system or HDF5. Mark > > On Tue, Oct 26, 2010 at 7:39 PM, Dave Wade-Stein <[email protected]> wrote: >> As to MPI, we're both using openmpi 1.4.1. >> >> We're both using NFS file systems which are formatted as xfs. As I >> mentioned, we had problems with ext3 filesystems, which were alleviated when >> we reformatted as xfs. Unfortunately, that didn't work for the customer. >> >> Thanks, >> Dave >> >> On Oct 26, 2010, at 5:36 PM, Mark Howison wrote: >> >>> I guess it could depend on the MPI library, but most likely not. What >>> parallel file system is used on the customer's machine? Mark >>> >>> On Tue, Oct 26, 2010 at 7:25 PM, Dave Wade-Stein <[email protected]> wrote: >>>> Mark, >>>> >>>> The same code hangs on the customer machine, but works fine on our >>>> clusters. Would that be possible if some subset aren't participating in >>>> the I/O? >>>> >>>> Thanks, >>>> Dave >>>> >>>> On Oct 26, 2010, at 5:14 PM, Mark Howison wrote: >>>> >>>>> Hi Dave, >>>>> >>>>> One common hang with collective-mode parallel I/O in HDF5 is when only >>>>> a subset of processes are participating in the I/O, but the other >>>>> processes haven't made an empty selection (to say that they are not >>>>> participating) using H5Sselect_none(). Also, have you tried >>>>> experimenting with collective vs. independent mode? >>>>> >>>>> Mark >>>>> >>>>> On Tue, Oct 26, 2010 at 6:52 PM, Dave Wade-Stein <[email protected]> wrote: >>>>>> We use hdf5 for parallel I/O in VORPAL, our laser plasma simulation >>>>>> code. For the most part, it works fine, but on certain machines (e.g., >>>>>> early Cray and BG/P) and certain types of filesystems, we've noticed >>>>>> that parallel I/O hangs, so we instituted a -id (individual dump) option >>>>>> which causes each MPI rank to dump its own hdf5 file, and once the >>>>>> simulation is complete, we merge the individual dump files. >>>>>> >>>>>> We have a customer for whom parallel I/O is hanging, and they are using >>>>>> -id as described above. We're trying to pinpoint why parallel I/O is not >>>>>> working on their system, which is CentOS 5.5 cluster. >>>>>> >>>>>> In the past we ourselves have had problems with parallel I/O failing on >>>>>> ext3 filesystems, so we reformatted as XFS and the problem went away. >>>>>> Our customer did this, but the problem still persists. >>>>>> >>>>>> Anyone have any words of wisdom as to what other things could cause >>>>>> parallel I/O to hang? >>>>>> >>>>>> Thanks for any help! >>>>>> Dave _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
