Rob

Thanks very much for this info. I've been reading the manuals and getting up to 
speed with the system. I've set some benchmarks running for parallel IO using 
multiple datasets, compound data types etc etc. 

when you say ...

> More generally, I've found that some of the default MPI-IO settings are
> probably not ideal for /Q, and have tested/suggested a change to the
> "number of I/O aggregators" defaults.

Do you mean aggregators inside romio, or gpfs itself. I was under the 
impression that on BGQ machines (which is what I'm targeting), the IO was 
shipped to the IO nodes which performed aggregation anyway. This  is what I was 
referring to when I said "shuffling data twice" - there's no point in hdf/mpiio 
performing collective IO as this task was being done by the OS. Am I to 
understand that the IO nodes don't natively do a very good job of it and need 
some assistance?

thanks

JB

> -----Original Message-----
> From: Rob Latham [mailto:[email protected]]
> Sent: 26 August 2013 16:38
> To: Biddiscombe, John A.
> Cc: HDF Users Discussion List
> Subject: Re: HDF5 and GFPS optimizations
> 
> On Mon, Aug 26, 2013 at 06:15:30AM +0000, Biddiscombe, John A. wrote:
> > Rob,
> >
> > Did you make any significant discoveries/progress regarding the GPFS
> tweaks on BG systems. Our machine will be open for use within the next
> week or so and I'd like to begin some profiling. I'd be interested in knowing
> if you have discovered any useful facts that I ought to know about.
> 
> An upcoming driver update (I don't know which one) will allow the Blue
> Gene compute nodes to send the gpfs_fcntl commands all the way through
> to the GPFS file system (presently the gpfs_fcntl commands return "not
> supported".  Then, we can do some experiments to see if they still provide
> any benefit at Blue Gene scales (the optimizations are 15 years old at this
> point, designed when "massively parallel system" was
> 32 nodes.
> 
> More generally, I've found that some of the default MPI-IO settings are
> probably not ideal for /Q, and have tested/suggested a change to the
> "number of I/O aggregators" defaults.
> 
> Meanwhile, ALCF (the folks who operate the machine) have been working
> with IBM to improve the state of collective I/O.  Seems like we're making
> some progress there as well.
> 
> > I'm concerned about how much the --enable-gpfs option is able to
> > 'know' about the system (can we easily find out what the option
> > does?). According to my superficial understanding of the BG
> > architecture, it seems that since the compute nodes have IO calls
> > forwarded off to the IO nodes by kernel level routines, collective
> > operations performed by hdf5 might actually reduce the effectiveness
> > of the IO by forcing the data to be shuffled around twice instead of
> > once. Am I thinking along the right lines?
> 
> The --enable-gpfs option will attempt to do a few things:
> 
> gpfs_access_range
> gpfs_free_range
> 
> This is the "multiple access range" hint, which tells GPFS "hey, don't grab a
> lock on the whole file.  instead, just these sections".  I
> *think* this is going to be one of the better improvements remaining.
> 
> gpfs_clear_file_cache
> gpfs_invalidate_file_cache
> 
> Good for benchmarking.  Ejects all entries from the gpfs page pool.
> 
> gpfs_cancel_hints
> 
> just resets things
> 
> gpfs_start_data_shipping
> gpfs_start_data_ship_map
> gpfs_stop_data_shipping
> 
> Unfortunately, GPFS-3.5 does not support data shipping any longer.
> 
> I still think these hints need to be implemented in the MPI-IO library, if 
> they
> still help at all, but if one is being pragmatic one might more easily deploy
> the hints through HDF5.
> 
> ==rob
> 
> --
> Rob Latham
> Mathematics and Computer Science Division Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Reply via email to