Rob, I found your explanation very helpful, at least to me. Are there documents listing all recognized hints by IBM and/or ROMIO? Also, can the xxx_size hints recognize something like “40MB” instead of “41943040”? (Of course, there is this ambiguity whether MB means 2^20 or 10^6.)
-Albert Cheng On Nov 13, 2014, at 9:25 AM, Rob Latham <[email protected]> wrote: > > > On 11/13/2014 06:34 AM, Angel de Vicente wrote: > >> >> thanks. I'm not sure if these could be tuned a bit better, but with the >> following hints the problem is all gone in the two problematic clusters >> (for a given file size, one of the writing modes of the program was >> taking about ~200x more time. With these hints all is back to normal, >> and the problematic mode takes just the same time as the other ones). >> > > You can pass anything you want for the "key": implementations will ignore > hints they do not understand. For the sake of anyone googling in the > future, I will explain what, if anything, the hints you passed in do: > > >> call MPI_Info_create(info, error) >> call MPI_Info_set(info,"IBM_largeblock_io","true", error) > > this hint is useful for IBM PE platforms and tells GPFS you are about to do > large I/O. Over time, this hint will become less useful: IBM is moving away > from their own MPI-IO implementation and incorporating ROMIO. > >> call MPI_Info_set(info,"stripping_unit","4194304", error) > > this one is probably the biggest help. In Collective I/O, ROMIO splits up > the file into "file domains" (and assigns those domains to a subset of > processors called I/O aggregators). When the "striping_unit" hint is set, > ROMIO will align those file domains to that striping_unit. > > Sometimes, like on Blue Gene, ROMIO will detect the file system block size > for you, and this hint is not needed. No harm in providing it, though. > > >> CALL >> MPI_INFO_SET(info,"H5F_ACS_CORE_WRITE_TRACKING_PAGE_SIZE_DEF","524288",error) > > I don't think this hint does anything. > >> CALL MPI_INFO_SET(info,"ind_rd_buffer_size","41943040", error) >> CALL MPI_INFO_SET(info,"ind_wr_buffer_size","5242880", error) >> CALL MPI_INFO_SET(info,"romio_ds_read","disable", error) >> CALL MPI_INFO_SET(info,"romio_ds_write","disable", error) > > No harm here, but if you are going to disable data sieving (romio_ds_read and > romio_ds_write) then there's no reason to tweak the independent read and > write buffer sizes. > >> CALL MPI_INFO_SET(info,"romio_cb_write","enable", error) > > On many platforms (but not Blue Gene), romio will look at the access pattern. > If the pattern is not interleaved, ROMIO will not use collective buffering. > At today's scale, collective buffering is almost always a win, especially on > GPFS when combined with striping_unit. > >> CALL MPI_INFO_SET(info,"cb_buffer_size","4194304", error) > > this buffer size might actually be a bit small, depending on how much data > you are writing/reading. If you have memory to spare, increasing this value > is often a good way to improve performance. > >> For the moment, problem solved. Thanks a lot, > > tuning these stacks honestly way harder than it should be. thanks for your > persistence. > > ==rob > > -- > Rob Latham > Mathematics and Computer Science Division > Argonne National Lab, IL USA > > _______________________________________________ > Hdf-forum is for HDF software users discussion. > [email protected] > http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org > Twitter: https://twitter.com/hdf5 _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
