Re: [Hdf-forum] rsync with hdf5 files

Mark Miller Mon, 19 Dec 2011 16:03:45 -0800

On Mon, 2011-12-19 at 12:05 -0800, John Knutson wrote:
> Paul Anton Letnes wrote:
> > Compression might screw with that, though. An idea is to use rsync 
> > compression instead, and leave the hdf5 files uncompressed. From man 
> > rsync:
> >   -z, --compress              compress file data during the transfer
> >
> > Cheers
> > Paul
> >   
> I'm a bit reluctant to turn off compression due to the storage 
> requirements of the data in question (it's an enormous amount of data 
> that happens to compress very well).  Still, aren't the chunks 
> compressed individually?  I was under the impression that when 
> compressing, each chunk was individually compressed.


Yes, chunks are compressed individually. One good reason for that is so
that partial readback doesn't wind up requiring decompression of entire
dataset. But, each compressed chunk can wind up taking a variable amount
of space in the file. Some chunks compress really well and others
don't. 

> As such, the only 
> things that should be changing are those chunks that have had new data 
> added, and the table of contents (I forget the term the devs use).  How 
> much is actually changed probably depends a lot on the pre-allocation of 
> data.

Might there be other things that could possibly contribute to a cascade
of differences? I mean, what about order of HDF5 writes on the file? If
you overwrite and/or extend an existing dataset, what about impact of
'garbage collection'? I mean, if you overwrite some dataset with new
data (for a portion of it), if the new dataset's chunks don't compress
into the same spaces the old chunks fit, then I think you can get some
re-arrangement of chunks in the file and possibly deadspace that
couldn't be reclaimed. Are their timestamps on these things too?

> 
> Now, I'm using the gzip *filter*, in conjunction with the shuffle 
> filter.  There's no indication of an "rsyncable" option here, or in the 
> (admittedly dated) gzip binaries I have installed.

I was not aware of that option for the gzip application (tool). And, I
am certain the HDF5 library does not have a 'property' to affect that in
its dataset creation property lists.

If the zlib has a way to affect rsycnable compression via its C
interface, you could try writing your own HDF5 filter to use in place of
HDF5's built-in gzip filter. I don't honestly know what affect chunking
would have on that though.



_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Re: [Hdf-forum] rsync with hdf5 files

Reply via email to