On Mon, 2011-12-19 at 12:05 -0800, John Knutson wrote: > Paul Anton Letnes wrote: > > Compression might screw with that, though. An idea is to use rsync > > compression instead, and leave the hdf5 files uncompressed. From man > > rsync: > > -z, --compress compress file data during the transfer > > > > Cheers > > Paul > > > I'm a bit reluctant to turn off compression due to the storage > requirements of the data in question (it's an enormous amount of data > that happens to compress very well). Still, aren't the chunks > compressed individually? I was under the impression that when > compressing, each chunk was individually compressed.
Yes, chunks are compressed individually. One good reason for that is so that partial readback doesn't wind up requiring decompression of entire dataset. But, each compressed chunk can wind up taking a variable amount of space in the file. Some chunks compress really well and others don't. > As such, the only > things that should be changing are those chunks that have had new data > added, and the table of contents (I forget the term the devs use). How > much is actually changed probably depends a lot on the pre-allocation of > data. Might there be other things that could possibly contribute to a cascade of differences? I mean, what about order of HDF5 writes on the file? If you overwrite and/or extend an existing dataset, what about impact of 'garbage collection'? I mean, if you overwrite some dataset with new data (for a portion of it), if the new dataset's chunks don't compress into the same spaces the old chunks fit, then I think you can get some re-arrangement of chunks in the file and possibly deadspace that couldn't be reclaimed. Are their timestamps on these things too? > > Now, I'm using the gzip *filter*, in conjunction with the shuffle > filter. There's no indication of an "rsyncable" option here, or in the > (admittedly dated) gzip binaries I have installed. I was not aware of that option for the gzip application (tool). And, I am certain the HDF5 library does not have a 'property' to affect that in its dataset creation property lists. If the zlib has a way to affect rsycnable compression via its C interface, you could try writing your own HDF5 filter to use in place of HDF5's built-in gzip filter. I don't honestly know what affect chunking would have on that though. _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
