On Tue, Dec 14, 2010 at 5:42 PM, Quincey Koziol <[email protected]> wrote:
> Hi Leigh, > > [snipped for brevity] Quincey, > > Probably a combination of both, namely, an ideal situation would be a group > of MPI ranks collectively writing one compressed HDF5 file. On Blue Waters a > 100kcore run with 32 cores/MCM could therefore result in say around 3000 > files, which is not unreasonable. > > Maybe I'm thinking about this too simply, but couldn't you compress the > data on each MPI rank, save it in a buffer, calculate the space required, > and the write it? I don't know enough about the internal workings of hdf5 to > know whether that would fit in the HDF5 model. In our particular application > on Blue Waters, memory is cheap, so there is lots of space in memory for > buffering data. > > > What you say above is basically what happens, except that space in the file > needs to be allocated for each block of compressed data. Since each block > is not the same size, the HDF5 library can't pre-allocate the space or > algorithmically determine how much to reserve for each process. In the case > of collective I/O, at least it's theoretically possible for all the > processes to communicate and work it out, but I'm not certain it's going to > be solvable for independent I/O, unless we reserve some processes to either > allocate space (like a "free space server") or buffer the "I/O", etc. > Could you make this work by forcing each core to have some specific chunking arrangement? For instance, you could have each core's dimension simply be the same dimension as each chunk, which actually works out pretty well in my application, at least in the horizontal. I typically have nxchunk=nx, nychunk=ny, and nzchunk to be something like 20 or so. But - now that I think about it, even if that were the case you don't know the size of the compressed chunks until you've compressed them and you'd still need to communicate the size of the compressed chunks amongst cores writing to an individual file. I don't know enough about hdf5 to understand how the preallocation process works. It sounds like you are allocating a bunch of zeroes (or something) on disk first, and then doing I/O straight to that space on disk? If this is the case then I can see how this necessitates some kind of collective communication if you are splitting up compression amongst MPI ranks. Personally I am perfectly happy with a bit of overhead which forces all cores to share amongst themselves what the compressed block size is before writing if it means we can do compression. Right now I see my choices as being (1) compressed, but 1 file per MPI rank, lots of files (2) No compression, fewer files, but perhaps compressing later on using h5repack, calling it in parallel, one h5repack per MPI rank as a post-processing step (yuck!). I'm glad you're working on this, personally I think this is important stuff for really huge simulations. In talking to other folks who will be using Blue Waters, compression is not much of an issue with many of them because of the nature of their data. Cloud data especially tends to compress very well. It would be a shame to fill terabytes of disk space with zeroes! I am sure we can still carry out our research objectives without compression, but the sheer amount of data we will be producing is staggering even with compression. Leigh > Quincey > > > _______________________________________________ > Hdf-forum is for HDF software users discussion. > [email protected] > http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org > > -- Leigh Orf Associate Professor of Atmospheric Science Department of Geology and Meteorology Central Michigan University Currently on sabbatical at the National Center for Atmospheric Research in Boulder, CO NCAR office phone: (303) 497-8200
_______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
