Re: [Hdf-forum] Poor performance with PHDF5

Mohamad Chaarawi Wed, 24 Apr 2013 06:28:52 -0700


On 4/23/2013 11:19 AM, Maxime Boissonneault wrote:

Hi,
I am trying to write a ~24GB large array of floats to a file withPHDF5. I am running on a Lustre PFS, with IB networking. I am runningthe software on 128 processes, spread amongst 16 nodes of 8 coreseach. The MPI implementation is OpenMPI 1.6.3, and HDF5 is 1.8.10.
Each process is writing one regular hyperslab with a various offset.Not every process has a hyperslab of the same size, but they are closein size. Each process should therefore be writing around 192MB of data.
For some reason, it seems that if I set
H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE);
only the master node writes anything into the resulting file (and ittakes ~10 minutes to write it).

Are you saying that the master node writes its data and all the data ofthe other ranks? Or are you saying that there is a bug that only themaster node writes its data and the other ranks' data don't ever getwritten to the file? (I assume it's the former)

But, yes that shouldn't happen. Is the default number of aggregatorsthat OpenMPI sets in ROMIO, 1?BTW how did you determine that only the master node is writing data? Didyou add printfs in MPI_File_write_at_all?

HDF5 just calls into MPI-I/O with the data to be written, so the MPI-I/Olibrary selects the number of aggregators (writers).Could you set cb_nodes to something like, I don't know, 128 and try that(you can vary that to better tune your I/O). You can set that throughthe info object you pass to H5Pset_fapl_mpio().

Also set cb_buffer_size to something like your Lustre stripe size.


If instead I set
H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_INDEPENDENT);

all nodes write data, and it takes ~4-5 minutes to write the whole file.


Ok this behavior is normal, independent is well independent :-)



I am expecting two things that I don't see happening :
1) With Collective IOs, I would expect all ranks to write.

This is not correct. All ranks should write in the HDF5 library, but notall ranks should write in MPI-I/O. Depending on the collective algorithm(like two-phase), a subset of ranks will actually write the data(cb_nodes ranks).

2) With our lustre filesystem, I would expect way more than 100MB/sfor such collective IOs (at least around 1GB/s).

I have to ask this, but are you sure your stripe size and count are setto something large? The default stripe count is usually 1 or 2 whichkills performance when writing large amounts of data.


Thanks,
Mohamad



Any tips on what might be going one ?

Thanks,



_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] Poor performance with PHDF5

Reply via email to