HDF5 supports pluggable compression schemes, so this seems like it should be handled within the hdf5 library. The fastest seems to be blosc which is written by the PyTables author. Although this is not shipped by default with HDF5, if we include it in the BinDeps builds for hdf5 it would be a nice compressed default format.
On Tuesday, September 2, 2014 11:30:39 AM UTC-4, Douglas Bates wrote: > > Now that the JLD format can handle DataFrame objects I would like to > switch from storing data sets in .RData format to .jld format. Datasets > stored in .RData format are compressed after they are written. The default > compression is gzip. Bzip2 and xz compression are also available. The > compression can make a substantial difference in the file size because the > data values are often highly repetitive. > > JLD is different in scope in that .jld files can be queried using external > programs like h5ls and the files can have new data added or existing data > edited or removed. The .RData format is an archival format. Once the file > is written it cannot be modified in place. > > Given these differences I can appreciate that JLD files are not > compressed. Nevertheless I think it would be useful to adopt a convention > in the JLD module for accessing data from files with a .jld.xz or .jld.7z > extension. It could be as simple as uncompressing the files in a temporary > directory, reading then removing, or it could be more sophisticated. I > notice that my versions of libjulia.so on an Ubuntu 64-bit system are > linked against both libz.so and liblzma.so > > $ ldd /usr/lib/x86_64-linux-gnu/julia/libjulia.so > linux-vdso.so.1 => (0x00007fff5214f000) > libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f62932ee000) > libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f62930d5000) > libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f6292dce000) > librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f6292bc6000) > libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 > (0x00007f62929a8000) > libunwind.so.8 => /usr/lib/x86_64-linux-gnu/libunwind.so.8 > (0x00007f629278c000) > libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 > (0x00007f6292488000) > libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f6292272000) > libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6291eab000) > /lib64/ld-linux-x86-64.so.2 (0x00007f62944b3000) > liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x00007f6291c89000) > > > AFAIK the user-level interface to gzip requires the GZip package. Unless > I have missed something (always a possibility) there is no user-level > interface to liblzma in Julia. If the library is going to be linked > anyway, would it make sense to provide a user-level interface in Julia? >
