I guess I neglected to add... My comments assume a common use case; that datasets are stored CONTIGUOUSLY (not blocked). HDF5 lib doesn't support filters on CONTIG datasets presently. I'd like to see that change.
For a lot of HPC applications using Poor Man's Parallel I/O, datasets are written in their entirety in a single H5Dwrite call and DO NOT require an special 'arranging' in the file to optimize for possible, future partial reads (e.g. blocking) either. They are read back in their entirety in a single H5Dread call. Nonetheless, even if datasets are blocked, then the allocation problem (and my pseudo-solution below) need to work on a block-by-block basis. Mark On Tue, 2010-12-14 at 08:45 -0800, Mark Miller wrote: > On Tue, 2010-12-14 at 05:40 -0800, Quincey Koziol wrote: > > > The primary problem is the space allocation that has to happen when > > data is compressed. This is particularly a problem when performing > > independent I/O, since the other processes aren't involved, but > > [eventually] need to know about space that was allocated. Collective > > I/O is easier, but still will require changes to HDF5, etc. Are you > > wanting to use collective or independent I/O for your dataset writing? > > > > I've had to deal with the space allocation issue for different reasons > using custom compression filters (Peter Lindstrum's FPZIP and HZIP for > structured meshes of hexs or tets and variables thereon). > > I think HDF5 lib could 'solve' the allocation problem using an approach > I took. However, you do have to 'get comfortable' with the idea that you > might not utilize space in file 100% optimally. > > Here is how it would work. Define a target compression ratio, R:1, that > *must* be achieved for a given dataset. If the dataset is N bytes > uncompressed, it will be NO MORE than N/R bytes compressed. Allocate N/R > bytes in the file for this dataset. If you succeed in compressing by at > least a ratio of R, your golden. If not, fail the write and return an > 'unable to compress to target ratio' error. The caller can decide to try > again with a different target ratio (which will probably require some > collective communication as all procs will need to know the newer size). > > If you succeed and compress by MORE than ratio of R, you waste some > space in the file. So what. Disk is cheap! > > Sometimes, you can take a small sample of the dataset (say the first M > bytes, or some bytes from beginning, middle and end), compress it > yourself to get an approximate idea of how 'compressible' it might be > and then set R based on that quick approximation. In addition, if HDF5 > returned to you information on how 'well' it was doing relative to > compression targets (something like 'did better than target ratio by > 10%' or 'missed target ratio by 3 %'), you can adjust target ratio as > necessary. > > > Mark > > -- Mark C. Miller, Lawrence Livermore National Laboratory ================!!LLNL BUSINESS ONLY!!================ [email protected] urgent: [email protected] T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511 _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
