Re: [Hdf-forum] Poor performance with PHDF5

Mohamad Chaarawi Wed, 24 Apr 2013 06:58:16 -0700

Hi Maxime,

On 4/24/2013 8:45 AM, Maxime Boissonneault wrote:

Hi Mohamad,
I did further testing yesterday, varying the stripe count of my outputfile, and switching between collective or independent, but first letme answer your questions.
Are you saying that the master node writes its data and all the dataof the other ranks? Or are you saying that there is a bug that onlythe master node writes its data and the other ranks' data don't everget written to the file? (I assume it's the former)
But, yes that shouldn't happen. Is the default number of aggregatorsthat OpenMPI sets in ROMIO, 1?BTW how did you determine that only the master node is writing data?Did you add printfs in MPI_File_write_at_all?
I am saying that the master nodes write the data for all the otherranks. I determine this by monitoring the nodes IOPS and read/writeper second through our ganglia.
HDF5 just calls into MPI-I/O with the data to be written, so theMPI-I/O library selects the number of aggregators (writers).Could you set cb_nodes to something like, I don't know, 128 and trythat (you can vary that to better tune your I/O). You can set thatthrough the info object you pass to H5Pset_fapl_mpio().
Also set cb_buffer_size to something like your Lustre stripe size.
I will look at this further and do some testing with those parameters.

Ok. I would vary the cb_nodes between 16, 32, 64 and 128 just to seewhich is ideal for your application/file system combination.

I am expecting two things that I don't see happening :
1) With Collective IOs, I would expect all ranks to write.
This is not correct. All ranks should write in the HDF5 library, butnot all ranks should write in MPI-I/O. Depending on the collectivealgorithm (like two-phase), a subset of ranks will actually write thedata (cb_nodes ranks).
I rather meant that I would expect all nodes to write (maybe not allranks).

It doesn't have to be that either. It totally depends on the accesspattern of your ranks in the application. I don't think the currenttwo_phase in ROMIO takes into account rank placement on nodes, but justhow much each rank is writing and how many ranks you have.

2) With our lustre filesystem, I would expect way more than 100MB/sfor such collective IOs (at least around 1GB/s).
The initial numbers were with a stripe count of 1.


Yes I would definitely increase that.

I did some more testing when varying the stripe count, on twodifferent filesystems :
- One has 8 targets and is idle (our test filesystem)
- One has 64 targets and is more or less busy.
I was writing with 16 nodes, 128 MPI ranks. With collective IOs, Iobtained the following rates :
FS with 64 targets :
sc = 1 : 171 ± 13 MB/s
sc = 8 : 937 ± 34 MB/s
sc = -1 : 1102 ± 19 MB/s


what is -1 here?


FS with 8 targets :
sc = 1 : 249 ± 4 MB/s
sc = 8 : 1218 ± 47 MB/s


ok this sounds more reasonable now (with a larger sc).

With independent IO, I obtained the rates :
FS with 64 targets :
sc = 1 : 240 ± 12 MB/s
sc = 8 : 1362 ± 79 MB/s
sc = -1 : 948 ± 48 MB/s

FS with 8 targets :
sc = 1 : 581 ± 7 MB/s
sc = 8 : 2700 ± 200 MB/s
The error bar that I give is the standard deviation over 3 runs. Thestripe size was left to 1 MB, which is aligned with our raid blocks. Ialso did testing with 8 nodes (64 MPI ranks) and obtained very similarrates.
What puzzles me is that independent IOs perform either as good or muchbetter than collective ones. Maybe this has to do with the cb_nodesparameter.

yes, If only 1 rank is chosen as an aggregator in ROMIO for collectiveI/O, this is definitely the issue you are seeing. Increasing that shouldget you better results.


Thanks,
Mohamad


Thanks again for your reply.

Best regards,



_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] Poor performance with PHDF5

Reply via email to