Wasn't aware of IOR, thank for the tip. We'll give that a try.

Dave


On Oct 26, 2010, at 5:45 PM, Mark Howison wrote:

> Have you tried using a benchmark like IOR to stress the NFS file
> system? Maybe it is a problem with NFS and not the underlying file
> system or HDF5. Mark
> 
> On Tue, Oct 26, 2010 at 7:39 PM, Dave Wade-Stein <[email protected]> wrote:
>> As to MPI, we're both using openmpi 1.4.1.
>> 
>> We're both using NFS file systems which are formatted as xfs. As I 
>> mentioned, we had problems with ext3 filesystems, which were alleviated when 
>> we reformatted as xfs. Unfortunately, that didn't work for the customer.
>> 
>> Thanks,
>> Dave
>> 
>> On Oct 26, 2010, at 5:36 PM, Mark Howison wrote:
>> 
>>> I guess it could depend on the MPI library, but most likely not. What
>>> parallel file system is used on the customer's machine? Mark
>>> 
>>> On Tue, Oct 26, 2010 at 7:25 PM, Dave Wade-Stein <[email protected]> wrote:
>>>> Mark,
>>>> 
>>>> The same code hangs on the customer machine, but works fine on our 
>>>> clusters. Would that be possible if some subset aren't participating in 
>>>> the I/O?
>>>> 
>>>> Thanks,
>>>> Dave
>>>> 
>>>> On Oct 26, 2010, at 5:14 PM, Mark Howison wrote:
>>>> 
>>>>> Hi Dave,
>>>>> 
>>>>> One common hang with collective-mode parallel I/O in HDF5 is when only
>>>>> a subset of processes are participating in the I/O, but the other
>>>>> processes haven't made an empty selection (to say that they are not
>>>>> participating) using H5Sselect_none(). Also, have you tried
>>>>> experimenting with collective vs. independent mode?
>>>>> 
>>>>> Mark
>>>>> 
>>>>> On Tue, Oct 26, 2010 at 6:52 PM, Dave Wade-Stein <[email protected]> wrote:
>>>>>> We use hdf5 for parallel I/O in VORPAL, our laser plasma simulation 
>>>>>> code. For the most part, it works fine, but on certain machines (e.g., 
>>>>>> early Cray and BG/P) and certain types of filesystems, we've noticed 
>>>>>> that parallel I/O hangs, so we instituted a -id (individual dump) option 
>>>>>> which causes each MPI rank to dump its own hdf5 file, and once the 
>>>>>> simulation is complete, we merge the individual dump files.
>>>>>> 
>>>>>> We have a customer for whom parallel I/O is hanging, and they are using 
>>>>>> -id as described above. We're trying to pinpoint why parallel I/O is not 
>>>>>> working on their system, which is CentOS 5.5 cluster.
>>>>>> 
>>>>>> In the past we ourselves have had problems with parallel I/O failing on 
>>>>>> ext3 filesystems, so we reformatted as XFS and the problem went away. 
>>>>>> Our customer did this, but the problem still persists.
>>>>>> 
>>>>>> Anyone have any words of wisdom as to what other things could cause 
>>>>>> parallel I/O to hang?
>>>>>> 
>>>>>> Thanks for any help!
>>>>>> Dave


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to