I think it would be very much in line with our general ethos of the default thing we do is the fastest possible thing – and it seems like blosc is that.
On Tue, Sep 2, 2014 at 3:11 PM, Jake Bolewski <[email protected]> wrote: > I've used Blosc in the past with great success. Oftentimes it is faster > than the uncompressed version if IO is the bottleneck. The compression > ratios are not great but that is really not the point. > > > On Tuesday, September 2, 2014 2:09:20 PM UTC-4, Stefan Karpinski wrote: > >> That looks pretty sweet. It seems to avoid a lot of the pitfalls of >> naively compressing data files while still getting the benefits. It would >> be great to support that in JLD, maybe even turned on by default. >> >> >> On Tue, Sep 2, 2014 at 1:35 PM, Kevin Squire <[email protected]> wrote: >> >>> Just to hype blosc a little more, see >>> >>> http://www.blosc.org/blosc-in-depth.html >>> >>> The main feature is that data is chunked so that the compressed chunk >>> size fits into L1 cache, and is then decompressed and used there. There >>> are a few more buzzwords (multithreading, simd) in the link above. Worth >>> exploring where this might be useful in Julia. >>> >>> Cheers, >>> Kevin >>> >>> >>> On Tuesday, September 2, 2014, Tim Holy <[email protected]> wrote: >>> >>>> HDF5/JLD does support compression: >>>> https://github.com/timholy/HDF5.jl/blob/master/doc/hdf5. >>>> md#reading-and-writing-data >>>> >>>> But it's not turned on by default. Matlab uses compression by default, >>>> and >>>> I've found it's a huge bottleneck in terms of performance >>>> (http://www.mathworks.com/matlabcentral/fileexchange/ >>>> 39721-save-mat-files-more-quickly). But perhaps there's a good middle >>>> ground. It would take someone >>>> doing a little experimentation to see what the compromises are. >>>> >>>> --Tim >>>> >>>> On Tuesday, September 02, 2014 08:30:39 AM Douglas Bates wrote: >>>> > Now that the JLD format can handle DataFrame objects I would like to >>>> switch >>>> > from storing data sets in .RData format to .jld format. Datasets >>>> stored in >>>> > .RData format are compressed after they are written. The default >>>> > compression is gzip. Bzip2 and xz compression are also available. >>>> The >>>> > compression can make a substantial difference in the file size >>>> because the >>>> > data values are often highly repetitive. >>>> > >>>> > JLD is different in scope in that .jld files can be queried using >>>> external >>>> > programs like h5ls and the files can have new data added or existing >>>> data >>>> > edited or removed. The .RData format is an archival format. Once >>>> the file >>>> > is written it cannot be modified in place. >>>> > >>>> > Given these differences I can appreciate that JLD files are not >>>> compressed. >>>> > Nevertheless I think it would be useful to adopt a convention in the >>>> JLD >>>> > module for accessing data from files with a .jld.xz or .jld.7z >>>> extension. >>>> > It could be as simple as uncompressing the files in a temporary >>>> directory, >>>> > reading then removing, or it could be more sophisticated. I notice >>>> that my >>>> > versions of libjulia.so on an Ubuntu 64-bit system are linked against >>>> both >>>> > libz.so and liblzma.so >>>> > >>>> > $ ldd /usr/lib/x86_64-linux-gnu/julia/libjulia.so >>>> > linux-vdso.so.1 => (0x00007fff5214f000) >>>> > libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f62932ee000) >>>> > libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f62930d5000) >>>> > libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f6292dce000) >>>> > librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f6292bc6000) >>>> > libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 >>>> > (0x00007f62929a8000) >>>> > libunwind.so.8 => /usr/lib/x86_64-linux-gnu/libunwind.so.8 >>>> > (0x00007f629278c000) >>>> > libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 >>>> > (0x00007f6292488000) >>>> > libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 >>>> (0x00007f6292272000) >>>> > libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6291eab000) >>>> > /lib64/ld-linux-x86-64.so.2 (0x00007f62944b3000) >>>> > liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 >>>> (0x00007f6291c89000) >>>> > >>>> > >>>> > AFAIK the user-level interface to gzip requires the GZip package. >>>> Unless I >>>> > have missed something (always a possibility) there is no user-level >>>> > interface to liblzma in Julia. If the library is going to be linked >>>> > anyway, would it make sense to provide a user-level interface in >>>> Julia? >>>> >>>> >>
