Mark, On Mon, Apr 11, 2011 at 3:35 PM, Mark Miller <[email protected]> wrote:
> On Mon, 2011-04-11 at 13:32, Leigh Orf wrote: > > If you look at some of my recent posts on this list, you'll find that > > I am having the same problem with collective I/O with Lustre, trying > > to have 30,000 ranks write to one file using collective pHDF5 I/O > > (with or without chunking, I still get bad performance). > > > > In fact, I have given up pursing this approach and am now trying the > > core driver with serial HDF5, which lets you do buffered I/O. > > Hopefully, you have enough memory to write whole file to core in all use > cases you are interested in. > > For my > > problem, I am able to buffer over 50 writes to memory before data > > needs to be flushed to disk. > > None of the serial VFD's currently do anything 'special' in the way of > buffering I/O requests from HDF5 to the underlying filesystem. The stdio > VFD may offer more as it will rely upon whatever buffering the > implementation of stdio on top of your filesystem does. > Understand that I just discovered the ability to do buffered I/O with hdf5. I wasn't aware of the core serial driver until Friday! My type of problem is characterized by a very small memory footprint per core for the simulation code, and frequent writes of the model state to disk. By simply using the core driver I can reduce the frequency of hitting the disk by a factor of 50-100, which is huge and at least with kraken/Lustre, I have found the less you do I/O the better. > However, you are stuck with > 1-file-per-core with this method, and each file will contain multiple > time levels - but I can happily deal with this if I/O is respectable. There is no reason you have to write a file per mpi-task this way. > Certainly, its simplest to do. But, its almost as simple to collect data > from different MPI tasks into common files. > > I've attached a header file for a simple interface (pmpio.h) that allows > you to run on say 100,000 mpi tasks but write to say just 128 files or > any number you pick at run time. I've attached an example of how the > simple pmpio interface is used to do it. > I am going to looking carefully at your code. At first glance, it appears to be a similar approach to what I have tried but in my case I created new MPI communicators which spanned any number of cores (but it has to divide evenly into the full problem, unlike with your approach). In my case, each subcommunicator would use pHDF5 collective calls to concurrently write to its own file, and I could choose the number of files. I still had lousy performance with all my choices of number of files. It is not entirely clear to me that you are doing true collective parallel HDF5 (where I have had problems but have been led to believe it is a path to happiness) as you do not call h5pset_dxpl_mpio and set the H5FD_MPIO_COLLECTIVE flag. You also do not construct a property list and pass it to h5dwrite, instructing each I/O core to write its own piece of a hdf5 file using offset arrays, h5sselect_hyperslab calls etc., which is what the examples I have found led me to. It seems you are effectively doing serial hdf5 in parallel, which is what I am leaning towards at this point. Your approach is more elegant than mine but I am (a) stuck with fortran and (b) not a programmer by training, although C is my preferred language for I/O. Not sure if I could call your code from fortran easily without going through contortions (again forgive me, I am a weather guy who pretends he is a programmer). I fully embraced parallel hdf5 because I thought it could give me all the flexibility I needed to essentially tune the number of files I wrote, giving me the option of anywhere from 1 to ncores. While I succeeded in attaining that flexibility, I have had awful luck with performance with 30kcores on kraken/Lustre. It is very possible there is a solution that I haven't found or that I am doing something stupid but I have spent enough time on this and tried enough things that I am ready to try something new, and the core driver currently has me very interested. As does your approach. >> I have not yet benchmarked performance with this buffered I/O approach >> on kraken (100,000 core machine with Lustre) but I will soon. At least >> with serial hdf5 I don't have to worry about exactly what's going on >> at the MPI layer which makes it difficult to debug. And you can easily do compression, among other things. > Indeed! > > > > > I will be be closely monitoring this thread in case you are able to > > find a solution, as I am still interested in getting collective pHDF5 > > to work with many cores on Lustre. > > > > Leigh > > > > On Tue, Apr 5, 2011 at 1:42 PM, Biddiscombe, John A. <[email protected]> > wrote: > > >>> Have you tried using the MPI-POSIX VFD for independent access? > > > > > >>Thanks - I'll report back with further findings > > > > > > Rubbish! I still only get 150MB/s with the mpiposix driver. > > > > > > As Queen Victoria would have said "we are not amused" > > > > > > I suspect I've got an error somewhere because something should have > changed. > > > > > > JB > > > > > > > > > -----Original Message----- > > > From: [email protected] [mailto: > [email protected]] On Behalf Of Mark Howison > > > Sent: 05 April 2011 17:19 > > > To: HDF Users Discussion List > > > Subject: Re: [Hdf-forum] Chunking : was Howison et al Lustre mdc_config > > > > > > Hi John, > > > > > > What platform and parallel file system is this on? Have you tried > > > using the MPI-POSIX VFD for independent access? > > > > > > Thanks, > > > Mark > > > > > > On Tue, Apr 5, 2011 at 11:16 AM, Biddiscombe, John A. < > [email protected]> wrote: > > >> Elena, > > >> > > >> I was just replying to myself when your email came in. > > >> I knocked up a quick wrapper to enable testing of the cache eviction > stuff so I'm happy (ish). > > >> > > >> However : I'm seeing puzzling behaviour with chunking. > > >> Using collective IO, (tweaking various params) I see transfer rates > from 1GB/s to 3GB/s depending on stripe size, number of cb_nodes, etc. > > >> > > >> However, when using chunking with independent IO, I set stripe size to > 6MB, chunk dims to match the 6MB, each node is writing 6MB, alignment is set > to 6MB intervals and I've followed all the tips I can find. I see (for 512 > nodes writing 6MB each = 4GB total) a max throughput of around 150MB/s. > > >> > > >> This is shockingly slow compared to collective IO and I'm quite > surprised. I've been playing with this for a few days now and my general > impression is that > > >> chunking = rubbish > > >> collective = nice > > >> > > >> I did not expect chunking to be so bad compared to collective (which > is a shame as I was hoping to use it for compression etc). > > >> > > >> Can anyone suggest further tweaks that I should be looking out for to > change. (one thing for example that seems to make no difference is the > H5Pset_istore_k(fcpl, btree_ik); stuff. I still don't quite know what the > correct value for btree_ik is. Ive read the man page, but I'm puzzled as to > the correct meaning. if I know there will be 512 chunks, what is the 'right' > value of btree_ik? > > >> > > >> Any clues gratefully received for optimizing chunking. I hoped the > thread about 30,000 processes would carry on as I found it interesting to > follow. > > >> > > >> ttfn > > >> > > >> JB > > >> > > >> > > >> > > >> -----Original Message----- > > >> From: [email protected] [mailto: > [email protected]] On Behalf Of Elena Pourmal > > >> Sent: 05 April 2011 17:01 > > >> To: HDF Users Discussion List > > >> Subject: Re: [Hdf-forum] Question re: Howison et al Lustre mdc_config > tuning recommendations > > >> > > >> Hi John, > > >> On Apr 5, 2011, at 4:21 AM, Biddiscombe, John A. wrote: > > >> > > >>>>>> H5AC_cache_config_t mdc_config; > > >>>>>> hid_t file_id; > > >>>>>> file_id = H5Fopen("file.h5", H5ACC_RDWR, H5P_DEFAULT); > > >>>>>> mdc_config.version = H5AC__CURR_CACHE_CONFIG_VERSION; > > >>>>>> H5Pget_mdc_config(file_id, &mdc_config) > > >>>>>> mdc_config.evictions_enabled = 0 /* FALSE */; > > >>>>>> mdc_config.incr_mode = H5C_incr__off; > > >>>>>> mdc_config.decr_mode = H5C_decr__off; > > >>>>>> H5Pset_mdc_config(file_id, &mdc_config); > > >>> > > >>> I couldn't find fortran bindings for these. Do they exist in any > recent releases or svn branches. > > >>> > > >> Fortran wrappers for do not exist. Please let us know which Fortran > wrappers do you need and we will add them to our to-do list. > > >> > > >> Elena > > >>> thanks > > >>> > > >>> JB > > >>> > -- Leigh Orf Associate Professor of Atmospheric Science Department of Geology and Meteorology Central Michigan University Currently on sabbatical at the National Center for Atmospheric Research in Boulder, CO NCAR office phone: (303) 497-8200
_______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
