Re: [Hdf-forum] Slow writing parallel HDF5 performance (for only one variable)

Rob Latham Thu, 13 Nov 2014 10:27:11 -0800


On 11/13/2014 09:55 AM, Albert Cheng wrote:

Rob,

I found your explanation very helpful, at least to me.
Are there documents listing all recognized hints by IBM and/or ROMIO?


Well... sort of

IBM hints: (wow, these are hard to google! -- here's an older set ofdocumentation)


http://www-01.ibm.com/support/knowledgecenter/SSFK3V_1.3.0/com.ibm.cluster.pe.v1r3.pe500.doc/am107_ifopen.htm?lang=en

Cray hints: the intro_mpi man page on whatever system you are on is theauthoritative one, but you can find web copies of older versions likethis one:


https://fs.hlrs.de/projects/craydoc/docs/man/xe_mptm/51/cat3/intro_mpi.3.html

ROMIO hints:
http://www.mcs.anl.gov/research/projects/romio/doc/users-guide/node6.html

OpenMPI hints: same as ROMIO hints, unless you are using OMPIO in whichcase I don't think any hints are supported (OMPIO uses MCA parameters)

> Also, can the xxx_size hints recognize something like “40MB” insteadof “41943040”?

> (Of course, there is this ambiguity whether MB means 2^20 or 10^6.)

that certainly seems like a nice usability enhancement. I think some ofthe software engineering I did a couple years ago should make thiseasier to implement... but it's probably not a huge priority, sorry.


https://trac.mpich.org/projects/mpich/ticket/2197

==rob

-Albert Chen

On Nov 13, 2014, at 9:25 AM, Rob Latham <[email protected]> wrote:



On 11/13/2014 06:34 AM, Angel de Vicente wrote:


thanks. I'm not sure if these could be tuned a bit better, but with the
following hints the problem is all gone in the two problematic clusters
(for a given file size, one of the writing modes of the program was
taking about ~200x more time. With these hints all is back to normal,
and the problematic mode takes just the same time as the other ones).


You can pass anything you want for the "key": implementations will ignore hints 
they do not understand.   For the sake of anyone googling in the future, I will explain 
what, if anything, the hints you passed in do:

call MPI_Info_create(info, error)
call MPI_Info_set(info,"IBM_largeblock_io","true", error)


this hint is useful for IBM PE platforms and tells GPFS you are about to do 
large I/O.  Over time, this hint will become less useful: IBM is moving away 
from their own MPI-IO implementation and incorporating ROMIO.

call MPI_Info_set(info,"stripping_unit","4194304", error)


this one is probably the biggest help.  In Collective I/O, ROMIO splits up the file into "file 
domains" (and assigns those domains to a subset of processors called I/O aggregators).  When 
the "striping_unit" hint is set, ROMIO will align those file domains to that 
striping_unit.

Sometimes, like on Blue Gene, ROMIO will detect the file system block size for 
you, and this hint is not needed.  No harm in providing it, though.

CALL 
MPI_INFO_SET(info,"H5F_ACS_CORE_WRITE_TRACKING_PAGE_SIZE_DEF","524288",error)


I don't think this hint does anything.

CALL MPI_INFO_SET(info,"ind_rd_buffer_size","41943040", error)
CALL MPI_INFO_SET(info,"ind_wr_buffer_size","5242880", error)
CALL MPI_INFO_SET(info,"romio_ds_read","disable", error)
CALL MPI_INFO_SET(info,"romio_ds_write","disable", error)


No harm here, but if you are going to disable data sieving (romio_ds_read and 
romio_ds_write) then there's no reason to tweak the independent read and write 
buffer sizes.

CALL MPI_INFO_SET(info,"romio_cb_write","enable", error)


On many platforms (but not Blue Gene), romio will look at the access pattern.  
If the pattern is not interleaved, ROMIO will not use collective buffering.  At 
today's scale, collective buffering is almost always a win, especially on GPFS 
when combined with striping_unit.

CALL MPI_INFO_SET(info,"cb_buffer_size","4194304", error)


this buffer size might actually be a bit small, depending on how much data you 
are writing/reading.  If you have memory to spare, increasing this value is 
often a good way to improve performance.

For the moment, problem solved. Thanks a lot,


tuning these stacks honestly way harder than it should be. thanks for your 
persistence.

==rob

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5



_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Slow writing parallel HDF5 performance (for only one variable)

Reply via email to