A Friday 15 May 2009 17:40:15 Francesc Alted escrigué: > A Friday 15 May 2009 15:40:16 David Fokkema escrigué: > > Hi list, > > > > I don't get this (using pytables 2.1.1): > > > > In [1]: import tables > > > > In [2]: data = tables.openFile('data_new.h5', 'w') > > > > In [3]: data.createVLArray(data.root, 'nosee', > > tables.Int32Atom())Out[3]: > > /nosee (VLArray(0,)) '' > > atom = Int32Atom(shape=(), dflt=0) > > byteorder = 'little' > > nrows = 0 > > flavor = 'numpy' > > > > In [4]: data.createVLArray(data.root, 'see', tables.Int32Atom(), > > filters=tables.Filters(complevel=1)) > > Out[4]: > > /see (VLArray(0,), shuffle, zlib(1)) '' > > atom = Int32Atom(shape=(), dflt=0) > > byteorder = 'little' > > nrows = 0 > > flavor = 'numpy' > > > > In [5]: a = 1000000 * [200] > > > > In [6]: for i in range(50): > > ...: data.root.see.append(a) > > ...: > > ...: > > > > In [7]: data.flush() > > > > And looking at the file: > > > > 191M 2009-05-15 15:37 data_new.h5 > > > > Also writing to the uncompressed table, adds another 191 Mb to the file. > > So, I really see no compression at all. I also tried zlib(9). Why are my > > arrays not compressed? The repetitive values seem like a perfect > > candidate for compression. > > Yes, I can reproduce this. Well, at least it seems that PyTables is > setting the filters correctly. For the 'see' dataset h5ls -v is reporting: > > Chunks: {2048} 32768 bytes > Storage: 800 logical bytes, 391 allocated bytes, 204.60% utilization > Filter-0: shuffle-2 OPT {16} > Filter-1: deflate-1 OPT {1} > Type: variable length of > native int > > which clearly demonstrate that the filters are correctly installed in the > HDF5 pipeline :-\ > > This definitely seems an HDF5 issue. To say the truth I've never seen good > compression rates in VLArrays (although I'd never thought that compression > was completely inexistent!). > > I'll try to report this to the hdf-forum list and get back to you.
Done. So, George Lewandowski answered this: """ For VL data, the dataset itself contains a struct which points to the actual data, which is stored elsewhere. When you apply compression, the "pointer" structs are compressed, but the data itself is not affected. """ So, we should expect gains only for compressing the pointer structure of the variable length dataset, and not the data itself. This effect can be seen when keeping smaller rows (in the order of tens, not millions, as in your example). For instance, this: In [57]: data = tb.openFile('/tmp/vldata2.h5', 'w') In [58]: data.createVLArray(data.root, 'see', tb.Int32Atom(), filters=tb.Filters(complevel=1)) Out[58]: /see (VLArray(0,), shuffle, zlib(1)) '' atom = Int32Atom(shape=(), dflt=0) byteorder = 'little' nrows = 0 flavor = 'numpy' In [59]: d = tb.numpy.arange(10) In [60]: for i in range(5000): data.root.see.append(d) ....: In [63]: data.close() creates a file of 301,104 bytes, while without using compression the size grows to 397,360 bytes. Here, the 'real' data only takes 200,000 bytes, and it has been the pointer structure that has been reduced from around 190,000 bytes to just 100,000 bytes, which a 2x compression rate (more or less). Apparently there is no provision in HDF5 for compressing actual data in variable length arrays. However, if this is a must for you you can always compress the data manually before writing it to disk, and decompress it after the reading process. Hope that helps, -- Francesc Alted "One would expect people to feel threatened by the 'giant brains or machines that think'. In fact, the frightening computer becomes less frightening if it is used only to simulate a familiar noncomputer." -- Edsger W. Dykstra ------------------------------------------------------------------------------ Crystal Reports - New Free Runtime and 30 Day Trial Check out the new simplified licensing option that enables unlimited royalty-free distribution of the report engine for externally facing server and web deployment. http://p.sf.net/sfu/businessobjects _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users