That looks pretty sweet. It seems to avoid a lot of the pitfalls of naively compressing data files while still getting the benefits. It would be great to support that in JLD, maybe even turned on by default.
On Tue, Sep 2, 2014 at 1:35 PM, Kevin Squire <[email protected]> wrote: > Just to hype blosc a little more, see > > http://www.blosc.org/blosc-in-depth.html > > The main feature is that data is chunked so that the compressed chunk size > fits into L1 cache, and is then decompressed and used there. There are a > few more buzzwords (multithreading, simd) in the link above. Worth > exploring where this might be useful in Julia. > > Cheers, > Kevin > > > On Tuesday, September 2, 2014, Tim Holy <[email protected]> wrote: > >> HDF5/JLD does support compression: >> >> https://github.com/timholy/HDF5.jl/blob/master/doc/hdf5.md#reading-and-writing-data >> >> But it's not turned on by default. Matlab uses compression by default, and >> I've found it's a huge bottleneck in terms of performance >> ( >> http://www.mathworks.com/matlabcentral/fileexchange/39721-save-mat-files-more-quickly). >> But perhaps there's a good middle ground. It would take someone >> doing a little experimentation to see what the compromises are. >> >> --Tim >> >> On Tuesday, September 02, 2014 08:30:39 AM Douglas Bates wrote: >> > Now that the JLD format can handle DataFrame objects I would like to >> switch >> > from storing data sets in .RData format to .jld format. Datasets >> stored in >> > .RData format are compressed after they are written. The default >> > compression is gzip. Bzip2 and xz compression are also available. The >> > compression can make a substantial difference in the file size because >> the >> > data values are often highly repetitive. >> > >> > JLD is different in scope in that .jld files can be queried using >> external >> > programs like h5ls and the files can have new data added or existing >> data >> > edited or removed. The .RData format is an archival format. Once the >> file >> > is written it cannot be modified in place. >> > >> > Given these differences I can appreciate that JLD files are not >> compressed. >> > Nevertheless I think it would be useful to adopt a convention in the >> JLD >> > module for accessing data from files with a .jld.xz or .jld.7z >> extension. >> > It could be as simple as uncompressing the files in a temporary >> directory, >> > reading then removing, or it could be more sophisticated. I notice >> that my >> > versions of libjulia.so on an Ubuntu 64-bit system are linked against >> both >> > libz.so and liblzma.so >> > >> > $ ldd /usr/lib/x86_64-linux-gnu/julia/libjulia.so >> > linux-vdso.so.1 => (0x00007fff5214f000) >> > libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f62932ee000) >> > libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f62930d5000) >> > libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f6292dce000) >> > librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f6292bc6000) >> > libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 >> > (0x00007f62929a8000) >> > libunwind.so.8 => /usr/lib/x86_64-linux-gnu/libunwind.so.8 >> > (0x00007f629278c000) >> > libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 >> > (0x00007f6292488000) >> > libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 >> (0x00007f6292272000) >> > libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6291eab000) >> > /lib64/ld-linux-x86-64.so.2 (0x00007f62944b3000) >> > liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x00007f6291c89000) >> > >> > >> > AFAIK the user-level interface to gzip requires the GZip package. >> Unless I >> > have missed something (always a possibility) there is no user-level >> > interface to liblzma in Julia. If the library is going to be linked >> > anyway, would it make sense to provide a user-level interface in Julia? >> >>
