Hi Marcial, On Wed, Oct 5, 2011 at 10:22 AM, marcialieec <[email protected]> wrote: > Hi! > > I'm using HDF5 to create an intermediate storage stage for an ESA project. > Currently our HDF5 file structure is the following: > > /Dset1 (with a compound datatype) > /Dset2 (vlen datatype) > /Dset3 (vlen datatype) > > Most of the data is in datasets 2 and 3. > > We are interested in using the HDF5 chunking and compression features. > However, although we have activated the GZIP compression the file almost > does not get compressed at all. I've been searching and I found some posts > saying that compression in vlen is only acting on the "references" of the > vlen dataset. > > My questions are: > > Is this still the actual situation? Is there a plan to have this > "solved/modified" in the near future? > What are our other options if vlen datasets cannot be compressed? Assuming > our data has a variable length nature, of course.
Yes, this is still the case. I'm pretty sure this is not going to be fixed anytime soon since it involves major library changes, though it's something we'd like to do. You have two other options: 1) If your data is only a little variable, you can create a regular chunked and compressed dataset that is guaranteed to fit any reasonable data, and store your data with some trailing empty space. The compression will usually efficiently handle the trailing space, even if it's quite large. The downside is that guessing a good size can be difficult and you'll have to come up with a scheme for handling data that exceed the pre-guessed fixed size. If you have a situation where you can pre-scan the input, you can always compute the fixed sizes to use. 2) Store your data concatenated in a 1D dataset and use a second dataset of start/end indexes to get the data for each element. This is basically how the HDF5 variable-length datatype works; you are just implementing it at a higher level using the public API. This probably will result in a larger file than the first scheme that I mentioned, and will have a slower access time but your mileage may vary so you'll want to try both on realistic data. It does have the advantage of handling any size data. I've also kicked around the idea of a hybrid scheme that uses a fixed-size dataset where the first bytes are interpreted as indexes into a secondary concatenated dataset if a magic byte, bit in a bitmap index, etc. is set, but this would be harder to implement in a nice way using the public API. Cheers, Dana > Thanks! > Marcial > > -- > View this message in context: > http://hdf-forum.184993.n3.nabble.com/Vlen-compression-tp3396836p3396836.html > Sent from the hdf-forum mailing list archive at Nabble.com. > > _______________________________________________ > Hdf-forum is for HDF software users discussion. > [email protected] > http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org > _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
