Hi Leigh, Ok, I understand.
Yes, in my world, timesteps near zero compress very well as all the data is generally just initial conditions; a lot of zeros. So, target compression ratios for early time in a simulation might be as high as 10:1. But as the simulation evolves, the data gets more 'noisy' and we'd reduce that to 3:1 or 2:1. If the dynamic range of possible compression is 2-4:1 then you could at least get within a factor of 2 of optimal. If the dynamic range is more like 2-10:1, then I agree you'd be giving up too much. Mark On Wed, 2010-12-15 at 11:18, Leigh Orf wrote: > Mark, > > Perhaps - however there will be a huge range of compression ratios in > our simulations. In many cases we literally have all the same floating > point values for a bunch of the variables in a given file. In other > cases, it's much less, getting only 2:1 or 4:1 compression ratios for > scale+offset+gzip. So with that kind of range, I'm not sure it would > be worth the effort. I'll mull it over some more though, there may be > a way to make it worth the effort. > > Leigh > > On Wed, Dec 15, 2010 at 12:13 PM, Mark Miller <[email protected]> > wrote: > Hi Leigh, > > I guess I am still interested to know whether an approach > where > specifying a minimum target compression ratio and then > allowing HDF5 to > (possibly over) allocate assuming a max. compressed size would > work for > you? > > Mark > > > On Wed, 2010-12-15 at 10:59, Leigh Orf wrote: > > > > On Tue, Dec 14, 2010 at 5:42 PM, Quincey Koziol > <[email protected]> > > wrote: > > Hi Leigh, > > > > > > [snipped for brevity] > > > > > Quincey, > > > > > > Probably a combination of both, namely, an ideal > situation > > > would be a group of MPI ranks collectively writing > one > > > compressed HDF5 file. On Blue Waters a 100kcore > run with 32 > > > cores/MCM could therefore result in say around > 3000 files, > > > which is not unreasonable. > > > > > > Maybe I'm thinking about this too simply, but > couldn't you > > > compress the data on each MPI rank, save it in a > buffer, > > > calculate the space required, and the write it? I > don't know > > > enough about the internal workings of hdf5 to know > whether > > > that would fit in the HDF5 model. In our > particular > > > application on Blue Waters, memory is cheap, so > there is > > > lots of space in memory for buffering data. > > > > > > > > > What you say above is basically what happens, except > that > > space in the file needs to be allocated for each > block of > > compressed data. Since each block is not the same > size, the > > HDF5 library can't pre-allocate the space or > algorithmically > > determine how much to reserve for each process. In > the case > > of collective I/O, at least it's theoretically > possible for > > all the processes to communicate and work it out, > but I'm not > > certain it's going to be solvable for independent > I/O, unless > > we reserve some processes to either allocate space > (like a > > "free space server") or buffer the "I/O", etc. > > > > Could you make this work by forcing each core to have some > specific > > chunking arrangement? For instance, you could have each > core's > > dimension simply be the same dimension as each chunk, which > actually > > works out pretty well in my application, at least in the > horizontal. I > > typically have nxchunk=nx, nychunk=ny, and nzchunk to be > something > > like 20 or so. But - now that I think about it, even if that > were the > > case you don't know the size of the compressed chunks until > you've > > compressed them and you'd still need to communicate the size > of the > > compressed chunks amongst cores writing to an individual > file. > > > > I don't know enough about hdf5 to understand how the > preallocation > > process works. It sounds like you are allocating a bunch of > zeroes (or > > something) on disk first, and then doing I/O straight to > that space on > > disk? If this is the case then I can see how this > necessitates some > > kind of collective communication if you are splitting up > compression > > amongst MPI ranks. > > > > Personally I am perfectly happy with a bit of overhead which > forces > > all cores to share amongst themselves what the compressed > block size > > is before writing if it means we can do compression. Right > now I see > > my choices as being (1) compressed, but 1 file per MPI rank, > lots of > > files (2) No compression, fewer files, but perhaps > compressing later > > on using h5repack, calling it in parallel, one h5repack per > MPI rank > > as a post-processing step (yuck!). > > > > I'm glad you're working on this, personally I think this is > important > > stuff for really huge simulations. In talking to other folks > who will > > be using Blue Waters, compression is not much of an issue > with many of > > them because of the nature of their data. Cloud data > especially tends > > to compress very well. It would be a shame to fill terabytes > of disk > > space with zeroes! I am sure we can still carry out our > research > > objectives without compression, but the sheer amount of data > we will > > be producing is staggering even with compression. > > > > Leigh > > > > > > Quincey > > > > _______________________________________________ > > Hdf-forum is for HDF software users discussion. > > [email protected] > > > http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org > > > > > > > > > > -- > > Leigh Orf > > Associate Professor of Atmospheric Science > > Department of Geology and Meteorology > > Central Michigan University > > Currently on sabbatical at the National Center for > Atmospheric > > Research in Boulder, CO > > NCAR office phone: (303) 497-8200 > > -- > Mark C. Miller, Lawrence Livermore National Laboratory > ================!!LLNL BUSINESS ONLY!!================ > [email protected] urgent: [email protected] > T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511 > > > _______________________________________________ > > Hdf-forum is for HDF software users discussion. > [email protected] > http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org > > > > > -- > Leigh Orf > Associate Professor of Atmospheric Science > Department of Geology and Meteorology > Central Michigan University > Currently on sabbatical at the National Center for Atmospheric > Research in Boulder, CO > NCAR office phone: (303) 497-8200 -- Mark C. Miller, Lawrence Livermore National Laboratory ================!!LLNL BUSINESS ONLY!!================ [email protected] urgent: [email protected] T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511 _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
