It would be best to incorporate it into the HDF5 package. A julia package would be useful if you wanted to do the same sort of compression on Julia binary blobs, such as serialized julia values in an IOBuffer.
On Tuesday, September 2, 2014 3:47:33 PM UTC-4, Douglas Bates wrote: > > Would it be reasonable to create a Blosc package or it is best to > incorporate it directly into the HDF5 package? If a separate package is > reasonable I could start on it, as I was the one who suggested this in the > first place. > > On Tuesday, September 2, 2014 2:43:15 PM UTC-5, Tim Holy wrote: >> >> All these testimonials do make it sound promising. Even three-fold >> compression >> is a pretty big deal. >> >> One disadvantage to compression is that it makes mmap impossible. But, >> since >> HDF5 supports hyperslabs, that's not as big a deal as it would have been. >> >> --Tim >> >> On Tuesday, September 02, 2014 12:11:55 PM Jake Bolewski wrote: >> > I've used Blosc in the past with great success. Oftentimes it is >> faster >> > than the uncompressed version if IO is the bottleneck. The compression >> > ratios are not great but that is really not the point. >> > >> > On Tuesday, September 2, 2014 2:09:20 PM UTC-4, Stefan Karpinski wrote: >> > > That looks pretty sweet. It seems to avoid a lot of the pitfalls of >> > > naively compressing data files while still getting the benefits. It >> would >> > > be great to support that in JLD, maybe even turned on by default. >> > > >> > > >> > > On Tue, Sep 2, 2014 at 1:35 PM, Kevin Squire <[email protected] >> > > >> > > <javascript:>> wrote: >> > >> Just to hype blosc a little more, see >> > >> >> > >> http://www.blosc.org/blosc-in-depth.html >> > >> >> > >> The main feature is that data is chunked so that the compressed >> chunk >> > >> size fits into L1 cache, and is then decompressed and used there. >> There >> > >> are a few more buzzwords (multithreading, simd) in the link above. >> Worth >> > >> exploring where this might be useful in Julia. >> > >> >> > >> Cheers, >> > >> >> > >> Kevin >> > >> >> > >> On Tuesday, September 2, 2014, Tim Holy <[email protected] >> <javascript:>> >> > >> >> > >> wrote: >> > >>> HDF5/JLD does support compression: >> > >>> >> > >>> >> https://github.com/timholy/HDF5.jl/blob/master/doc/hdf5.md#reading-and-w >> > >>> riting-data >> > >>> >> > >>> But it's not turned on by default. Matlab uses compression by >> default, >> > >>> and >> > >>> I've found it's a huge bottleneck in terms of performance >> > >>> ( >> > >>> >> http://www.mathworks.com/matlabcentral/fileexchange/39721-save-mat-files >> > >>> -more-quickly). But perhaps there's a good middle ground. It would >> take >> > >>> someone >> > >>> doing a little experimentation to see what the compromises are. >> > >>> >> > >>> --Tim >> > >>> >> > >>> On Tuesday, September 02, 2014 08:30:39 AM Douglas Bates wrote: >> > >>> > Now that the JLD format can handle DataFrame objects I would like >> to >> > >>> >> > >>> switch >> > >>> >> > >>> > from storing data sets in .RData format to .jld format. Datasets >> > >>> >> > >>> stored in >> > >>> >> > >>> > .RData format are compressed after they are written. The default >> > >>> > compression is gzip. Bzip2 and xz compression are also >> available. >> > >>> > The >> > >>> > compression can make a substantial difference in the file size >> because >> > >>> >> > >>> the >> > >>> >> > >>> > data values are often highly repetitive. >> > >>> > >> > >>> > JLD is different in scope in that .jld files can be queried using >> > >>> >> > >>> external >> > >>> >> > >>> > programs like h5ls and the files can have new data added or >> existing >> > >>> >> > >>> data >> > >>> >> > >>> > edited or removed. The .RData format is an archival format. >> Once the >> > >>> >> > >>> file >> > >>> >> > >>> > is written it cannot be modified in place. >> > >>> > >> > >>> > Given these differences I can appreciate that JLD files are not >> > >>> >> > >>> compressed. >> > >>> >> > >>> > Nevertheless I think it would be useful to adopt a convention in >> the >> > >>> >> > >>> JLD >> > >>> >> > >>> > module for accessing data from files with a .jld.xz or .jld.7z >> > >>> >> > >>> extension. >> > >>> >> > >>> > It could be as simple as uncompressing the files in a temporary >> > >>> >> > >>> directory, >> > >>> >> > >>> > reading then removing, or it could be more sophisticated. I >> notice >> > >>> >> > >>> that my >> > >>> >> > >>> > versions of libjulia.so on an Ubuntu 64-bit system are linked >> against >> > >>> >> > >>> both >> > >>> >> > >>> > libz.so and liblzma.so >> > >>> > >> > >>> > $ ldd /usr/lib/x86_64-linux-gnu/julia/libjulia.so >> > >>> > linux-vdso.so.1 => (0x00007fff5214f000) >> > >>> > libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 >> (0x00007f62932ee000) >> > >>> > libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f62930d5000) >> > >>> > libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f6292dce000) >> > >>> > librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 >> (0x00007f6292bc6000) >> > >>> > libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 >> > >>> > (0x00007f62929a8000) >> > >>> > libunwind.so.8 => /usr/lib/x86_64-linux-gnu/libunwind.so.8 >> > >>> > (0x00007f629278c000) >> > >>> > libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 >> > >>> > (0x00007f6292488000) >> > >>> > libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 >> > >>> >> > >>> (0x00007f6292272000) >> > >>> >> > >>> > libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6291eab000) >> > >>> > /lib64/ld-linux-x86-64.so.2 (0x00007f62944b3000) >> > >>> > liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 >> > >>> > (0x00007f6291c89000) >> > >>> > >> > >>> > >> > >>> > AFAIK the user-level interface to gzip requires the GZip package. >> > >>> >> > >>> Unless I >> > >>> >> > >>> > have missed something (always a possibility) there is no >> user-level >> > >>> > interface to liblzma in Julia. If the library is going to be >> linked >> > >>> > anyway, would it make sense to provide a user-level interface in >> > >>> > Julia? >> >>
