On Fri, 2009-05-15 at 19:47 +0200, Francesc Alted wrote: > A Friday 15 May 2009 17:40:15 Francesc Alted escrigué: > > A Friday 15 May 2009 15:40:16 David Fokkema escrigué: > > > Hi list, > > > > > > I don't get this (using pytables 2.1.1): > > > > > > In [1]: import tables > > > > > > In [2]: data = tables.openFile('data_new.h5', 'w') > > > > > > In [3]: data.createVLArray(data.root, 'nosee', > > > tables.Int32Atom())Out[3]: > > > /nosee (VLArray(0,)) '' > > > atom = Int32Atom(shape=(), dflt=0) > > > byteorder = 'little' > > > nrows = 0 > > > flavor = 'numpy' > > > > > > In [4]: data.createVLArray(data.root, 'see', tables.Int32Atom(), > > > filters=tables.Filters(complevel=1)) > > > Out[4]: > > > /see (VLArray(0,), shuffle, zlib(1)) '' > > > atom = Int32Atom(shape=(), dflt=0) > > > byteorder = 'little' > > > nrows = 0 > > > flavor = 'numpy' > > > > > > In [5]: a = 1000000 * [200] > > > > > > In [6]: for i in range(50): > > > ...: data.root.see.append(a) > > > ...: > > > ...: > > > > > > In [7]: data.flush() > > > > > > And looking at the file: > > > > > > 191M 2009-05-15 15:37 data_new.h5 > > > > > > Also writing to the uncompressed table, adds another 191 Mb to the file. > > > So, I really see no compression at all. I also tried zlib(9). Why are my > > > arrays not compressed? The repetitive values seem like a perfect > > > candidate for compression. > > > > Yes, I can reproduce this. Well, at least it seems that PyTables is > > setting the filters correctly. For the 'see' dataset h5ls -v is reporting: > > > > Chunks: {2048} 32768 bytes > > Storage: 800 logical bytes, 391 allocated bytes, 204.60% utilization > > Filter-0: shuffle-2 OPT {16} > > Filter-1: deflate-1 OPT {1} > > Type: variable length of > > native int > > > > which clearly demonstrate that the filters are correctly installed in the > > HDF5 pipeline :-\ > > > > This definitely seems an HDF5 issue. To say the truth I've never seen good > > compression rates in VLArrays (although I'd never thought that compression > > was completely inexistent!). > > > > I'll try to report this to the hdf-forum list and get back to you.
Wow, thanks! > > Done. So, George Lewandowski answered this: > > """ > For VL data, the dataset itself contains a struct which points to the actual > data, which is stored elsewhere. When you apply compression, the "pointer" > structs are compressed, but the data itself is not affected. > """ > > So, we should expect gains only for compressing the pointer structure of the > variable length dataset, and not the data itself. This effect can be seen > when keeping smaller rows (in the order of tens, not millions, as in your > example). For instance, this: > > In [57]: data = tb.openFile('/tmp/vldata2.h5', 'w') > > In [58]: data.createVLArray(data.root, 'see', tb.Int32Atom(), > filters=tb.Filters(complevel=1)) > > Out[58]: > /see (VLArray(0,), shuffle, zlib(1)) '' > atom = Int32Atom(shape=(), dflt=0) > byteorder = 'little' > nrows = 0 > flavor = 'numpy' > > In [59]: d = tb.numpy.arange(10) > > In [60]: for i in range(5000): > data.root.see.append(d) > ....: > > In [63]: data.close() > > creates a file of 301,104 bytes, while without using compression the size > grows to 397,360 bytes. Here, the 'real' data only takes 200,000 bytes, and > it has been the pointer structure that has been reduced from around 190,000 > bytes to just 100,000 bytes, which a 2x compression rate (more or less). > > Apparently there is no provision in HDF5 for compressing actual data in > variable length arrays. However, if this is a must for you you can always > compress the data manually before writing it to disk, and decompress it after > the reading process. Hmmm... that's a shame. Is there really no provision for it or is it just hard to set up? I'll have to think this over, then. I do need compression, because I'm basically storing parts of a terabyte dataset on my Eee pc, with which I'm very happy because of its weight and easy-to-travel-with design, but is a bit underpowered for real world data analysis. That may rule out compression because of CPU cycles, now that I think about it, :-/ Well, I'll try compression, serializing and storing as a string. Thanks, David ------------------------------------------------------------------------------ Crystal Reports - New Free Runtime and 30 Day Trial Check out the new simplified licensing option that enables unlimited royalty-free distribution of the report engine for externally facing server and web deployment. http://p.sf.net/sfu/businessobjects _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users