Hi Leigh and Mark, I'm just getting caught up on this thread now and wanted to add a few comments:
1) The idea of passing off a baton to serialize access to a file is something we tried in H5Part when we ran into a problem around 16K concurrency where Lustre would actually *time out* while trying to service that many independent writes to a shared file. This worked pretty well. We used a token-passing mechanism to write in batches of 2000, but the PMPIO idea takes this even further to where you are executing the writes in batches equal to the number of files. 2) On a Lustre file system with N OSTs, you could conceivably get the full bandwidth of the file system by writing into N separate files. >From Lustre's perspective, this looks like file-per-proc access, but in the end you have many fewer files, which is of course much better from a data management and post-analysis stand point. 3) I also agree that collective buffering is mostly useful for rectilinear grids, although I'll point out that it is also good for variable-length 1D arrays, like the flattened 1D AMR meshes from the CHOMBO benchmark that we tested in the HDF5/Lustre paper. These actually performed pretty well in collective mode. The trade-off with collective mode on the Cray is that it is nearly impossible to get the full bandwidth of the file system because of the synchronization of the aggregation phase (MPI_Gather). One optimization is to do something more sophisticated, where the aggregators are broken into two subsets that overlap gathering and writing. Of course, however it is implemented, collective buffering does impose some degree of synchronization that you can avoid with independent access or PMPIO. Mark On Tue, Apr 12, 2011 at 3:11 PM, Leigh Orf <[email protected]> wrote: > > > On Tue, Apr 12, 2011 at 11:57 AM, Mark Miller <[email protected]> wrote: >> >> On Tue, 2011-04-12 at 10:39 -0700, Leigh Orf wrote: >> >> > >> > >> > Understand that I just discovered the ability to do buffered I/O with >> > hdf5. I wasn't aware of the core serial driver until Friday! >> >> Yeah, there are a lot of interesting dark corners of HDF5 library that >> are useful to know about. Core driver is definitely one of them. That >> has saved my behind a few times when we've been in a bind on >> performance. > > Indeed. Write operations go pretty fast when there is no actual disk access! > >> >> >> > I am going to looking carefully at your code. At first glance, it >> > appears to be a similar approach to what I have tried but in my case I >> > created new MPI communicators which spanned any number of cores (but >> > it has to divide evenly into the full problem, unlike with your >> > approach). In my case, each subcommunicator would use pHDF5 collective >> > calls to concurrently write to its own file, and I could choose the >> > number of files. I still had lousy performance with all my choices of >> > number of files. >> > >> > It is not entirely clear to me that you are doing true collective >> > parallel HDF5 (where I have had problems but have been led to believe >> >> Thats right. There is NOTHING I/O-wise that is parallel. That code is >> designed to work with SERIAL compiled HDF5. The only parallel parts are >> the file management to orchestrate parallel I/O to multiple files >> concurrently. It is the 'Poor Mans' approach to parallel I/O. It is >> described in the pmpio.h header file a bit and more here... >> >> >> http://visitbugs.ornl.gov/projects/hpc-hdf5/wiki/Poor_Man's_vs_Rich_Mans'_Parallel_IO >> > > Yes, I am familiar the concept and have found that link before. It seems > we're at a point with HPC where there are still unanswered questions about > the fastest and most practical way to get data from CPU space to diskspace > in a way that also makes post-processing sufficiently simple. > > I see more clearly now what you're doing. You store each core's chunk of the > data as a hdf5 group in a file, and only one core accesses a file at a time. > So each core gets its baton, opens, writes, closes, and hands off. I am > pretty sure there is very little overhead for opens and closes when a single > core is doing it (as opposed to thousands of concurrent opens, for instance) > but perhaps it will be non-negligible when multiplied by 100,000? > > Your files contain multiple groups representing spatial things, my latest > attempt is multiple groups representing spatial information at a given > location. I see no reason why you couldn't have both multiple times (one > group hierarchy) and core contributions (another one) in a single file, all > designated by hdf5 groups. > > Your approach is one of the few remaining I have yet to try. I'm not sure it > will give better I/O performance than one file per core, and that is my main > concern right now. I am kind of used to using actual unix directories in the > way that you are using hdf5 groups, to some degree, to reduce the number of > files in a given directory. > > I notice you are a VisIt developer as well (I think we've corresponded > before). Is there a VisIt plugin using hdf5 for your PMMPI format? This > could be an additional motivator for me to try your approach, as I will be > using VisIt for visualization on huge runs soon. I've twice developed VisIt > plugins (hdf4, then 5) but only for file-per-core and my C++ skills are > somewhat lacking. > > Leigh > >> >> > it is a path to happiness) as you do not call h5pset_dxpl_mpio and >> > set the H5FD_MPIO_COLLECTIVE flag. You also do not construct a >> > property list and pass it to h5dwrite, instructing each I/O core to >> > write its own piece of a hdf5 file using offset arrays, >> > h5sselect_hyperslab calls etc., which is what the examples I have >> > found led me to. It seems you are effectively doing serial hdf5 in >> > parallel, which is what I am leaning towards at this point. Your >> > approach is more elegant than mine but I am (a) stuck with fortran and >> > (b) not a programmer by training, although C is my preferred language >> > for I/O. Not sure if I could call your code from fortran easily >> > without going through contortions (again forgive me, I am a weather >> > guy who pretends he is a programmer). >> > >> > I fully embraced parallel hdf5 because I thought it could give me all >> > the flexibility I needed to essentially tune >> >> So, I find the all-collective-all-the-time API for parallel HDF5 to be >> way too 'inflexible' to handle sophisticated I/O patterns where data >> type, size and shape and existence even vary substantially from >> processor-to-processor. For bread-and-butter data parallel apps where >> essentially the same few data structures (distributed arrayys) are >> distributed across processors, it works ok. But, none of the simulation >> apps I support have that kind of a (simple) I/O pattern nor even >> approximate it, especially for plot outputs. >> > > I think pHDF5 is very neat and useful in some situations. My own experience > is that with a modest number of cores (around 1k) performance is adequate, > but for whatever reason bumping it up another order of magnitude leads to > badness. > > > >> >> -- >> Mark C. Miller, Lawrence Livermore National Laboratory >> ================!!LLNL BUSINESS ONLY!!================ >> [email protected] urgent: [email protected] >> T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511 >> > > > > -- > Leigh Orf > Associate Professor of Atmospheric Science > Department of Geology and Meteorology > Central Michigan University > Currently on sabbatical at the National Center for Atmospheric Research > in Boulder, CO > NCAR office phone: (303) 497-8200 > > > _______________________________________________ > Hdf-forum is for HDF software users discussion. > [email protected] > http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org > > _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
