Hi Leigh,

It is possible to implement the round-robin approach, and in fact this
is something we tried in the H5Part library for writing in MPI-IO
independent mode at high concurrency (>=16000 cores). The reason it
was necessary was to "throttle" the number of write requests hitting
the lustre filesystem to avoid timeout errors. In our case, we were
using 8 groups of 2000 cores in the round-robin write.

I agree with Rob that you should try collective mode first, but if you
still want to pursue the round-robin approach you can open the file on
all cores with the MPI-IO independent mode VFD, and then use the
pseudo-code you presented below with the loop over MPI rank (note that
you need to include a barrier in the loop, and that you will also
probably need a hyperslab selection based on rank before the write
call).

If you want only one file handle, you could use a hybrid-parallel
approach where there is only one MPI task on the node, and each core
launches a separate pthread or OpenMP thread. Then you would open the
file once per node, and you could loop over thread ID within each MPI
task.

In our case, every core had an MPI task, so we passed a token using
point-to-point send/receives. I pasted some code snippets of this
below.

Mark

-----

h5part_int64_t
_H5Part_start_throttle (
        H5PartFile *f
        ) {

        if (f->throttle > 0) {
                int ret;
                int token = 1;
                _H5Part_print_info (
                        "Throttling with factor = %d",
                        f->throttle);
                if (f->myproc / f->throttle > 0) {
                        _H5Part_print_debug_detail (
                                "[%d] throttle: waiting on token from %d",
                                f->myproc, f->myproc - f->throttle);
                        // wait to receive token before continuing with read
                        ret = MPI_Recv(
                                &token, 1, MPI_INT,
                                f->myproc - f->throttle, // receive
from previous proc
                                f->myproc, // use this proc id as message tag
                                f->comm,
                                MPI_STATUS_IGNORE
                                );
                        if ( ret != MPI_SUCCESS ) return
HANDLE_MPI_SENDRECV_ERR;
                }
                _H5Part_print_debug_detail (
                        "[%d] throttle: received token",
                        f->myproc);
        }
        return H5PART_SUCCESS;
}

h5part_int64_t
_H5Part_end_throttle (
        H5PartFile *f
        ) {

        if (f->throttle > 0) {
                int ret;
                int token;
                if (f->myproc + f->throttle < f->nprocs) {
                        // pass token to next proc
                        _H5Part_print_debug_detail (
                                "[%d] throttle: passing token to %d",
                                f->myproc, f->myproc + f->throttle);
                        ret = MPI_Send(
                                &token, 1, MPI_INT,
                                f->myproc + f->throttle, // send to next proc
                                f->myproc + f->throttle, // use the id
of the target as tag
                                f->comm
                                );
                        if ( ret != MPI_SUCCESS ) return
HANDLE_MPI_SENDRECV_ERR;
                }
        }
        return H5PART_SUCCESS;
}


#ifdef PARALLEL_IO
        herr = _H5Part_start_throttle ( f );
        if ( herr < 0 ) return herr;
#endif

        herr = H5Dwrite (
                dataset_id,
                type,
                f->memshape,
                f->diskshape,
                f->xfer_prop,
                array );

#ifdef PARALLEL_IO
        herr = _H5Part_end_throttle ( f );
        if ( herr < 0 ) return herr;
#endif


On Thu, Dec 9, 2010 at 10:55 AM, Rob Latham <[email protected]> wrote:
> On Wed, Dec 08, 2010 at 03:38:42PM -0700, Leigh Orf wrote:
>> I am playing with different I/O strategies for massively parallel multicore
>> systems. Let's say I have a MPI communicator which represents a subset of
>> cores (such as the cores on a SMP chip) and I want the root rank on that
>> communicator to create the file, but I want all of the other cores to write
>> to that file in a sequential (round-robin) manner.
>
> It sounds to me like you are maybe over-thinking this problem.  You've
> got an MPI program, so you've got an MPI library, and that MPI library
> contains an MPI-IO implementation.  It's likely that MPI-IO
> implementation will take care of these problems for you.
>
>> ...
>
>> My question is: can I do the round-robin write part without having to open
>> and close (h5fopen_f / h5fclose_f) the file for each core?
>
> No, not at the HDF5 level.  If you are on a file system where that is
> possible, the MPI-IO library will do that for you.
>
>> It seems that file_id (and the other id's like dset_id, dspace_id
>> etc.) carries with it something tangible that links that id to the
>> file itself. I don't think I can get away with doing MPI_BCAST of
>> the file_id (and the other id variables) to the other cores and use
>> those handles to magically access the file created on the root
>> core... right? I am trying to avoid the overhead of opening,
>> seeking, closing for each core.
>
> Here's what I'd suggest: structure your code for parallel HDF5 with
> MPI-IO support and collecitve I/O enabled.   Then the library,
> whenever possible, will basically do the sorts of optimizations you're
> thinking about.  You do have to, via property lists, explicitly enable
> MPI-IO support and collective I/O.
>
> ==rob
>
> --
> Rob Latham
> Mathematics and Computer Science Division
> Argonne National Lab, IL USA
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to