On Fri, 2009-05-15 at 19:47 +0200, Francesc Alted wrote:
> A Friday 15 May 2009 17:40:15 Francesc Alted escrigué:
> > A Friday 15 May 2009 15:40:16 David Fokkema escrigué:
> > > Hi list,
> > >
> > > I don't get this (using pytables 2.1.1):
> > >
> > > In [1]: import tables
> > >
> > > In [2]: data = tables.openFile('data_new.h5', 'w')
> > >
> > > In [3]: data.createVLArray(data.root, 'nosee',
> > > tables.Int32Atom())Out[3]:
> > > /nosee (VLArray(0,)) ''
> > >   atom = Int32Atom(shape=(), dflt=0)
> > >   byteorder = 'little'
> > >   nrows = 0
> > >   flavor = 'numpy'
> > >
> > > In [4]: data.createVLArray(data.root, 'see', tables.Int32Atom(),
> > > filters=tables.Filters(complevel=1))
> > > Out[4]:
> > > /see (VLArray(0,), shuffle, zlib(1)) ''
> > >   atom = Int32Atom(shape=(), dflt=0)
> > >   byteorder = 'little'
> > >   nrows = 0
> > >   flavor = 'numpy'
> > >
> > > In [5]: a = 1000000 * [200]
> > >
> > > In [6]: for i in range(50):
> > >    ...:     data.root.see.append(a)
> > >    ...:
> > >    ...:
> > >
> > > In [7]: data.flush()
> > >
> > > And looking at the file:
> > >
> > > 191M 2009-05-15 15:37 data_new.h5
> > >
> > > Also writing to the uncompressed table, adds another 191 Mb to the file.
> > > So, I really see no compression at all. I also tried zlib(9). Why are my
> > > arrays not compressed? The repetitive values seem like a perfect
> > > candidate for compression.
> >
> > Yes, I can reproduce this.  Well, at least it seems that PyTables is
> > setting the filters correctly.  For the 'see' dataset h5ls -v is reporting:
> >
> >     Chunks:    {2048} 32768 bytes
> >     Storage:   800 logical bytes, 391 allocated bytes, 204.60% utilization
> >     Filter-0:  shuffle-2 OPT {16}
> >     Filter-1:  deflate-1 OPT {1}
> >     Type:      variable length of
> >                    native int
> >
> > which clearly demonstrate that the filters are correctly installed in the
> > HDF5 pipeline :-\
> >
> > This definitely seems an HDF5 issue.  To say the truth I've never seen good
> > compression rates in VLArrays (although I'd never thought that compression
> > was completely inexistent!).
> >
> > I'll try to report this to the hdf-forum list and get back to you.

Wow, thanks!

> 
> Done.  So, George Lewandowski answered this:
> 
> """
> For VL data, the dataset itself contains a struct which points to the actual 
> data, which is stored elsewhere.  When you apply compression, the "pointer" 
> structs are compressed, but the data itself is not affected.   
> """
> 
> So, we should expect gains only for compressing the pointer structure of the 
> variable length dataset, and not the data itself.  This effect can be seen 
> when keeping smaller rows (in the order of tens, not millions, as in your 
> example).  For instance, this:
> 
> In [57]: data = tb.openFile('/tmp/vldata2.h5', 'w')
> 
> In [58]: data.createVLArray(data.root, 'see', tb.Int32Atom(), 
> filters=tb.Filters(complevel=1))                                              
>     
> Out[58]:                                                        
> /see (VLArray(0,), shuffle, zlib(1)) ''                         
>   atom = Int32Atom(shape=(), dflt=0)                            
>   byteorder = 'little'                                          
>   nrows = 0                                                     
>   flavor = 'numpy'                                              
> 
> In [59]: d = tb.numpy.arange(10)
> 
> In [60]: for i in range(5000):
>     data.root.see.append(d)
>    ....:
> 
> In [63]: data.close()
> 
> creates a file of 301,104 bytes, while without using compression the size 
> grows to 397,360 bytes.  Here, the 'real' data only takes 200,000 bytes, and 
> it has been the pointer structure that has been reduced from around 190,000 
> bytes to just 100,000 bytes, which a 2x compression rate (more or less).
> 
> Apparently there is no provision in HDF5 for compressing actual data in 
> variable length arrays.  However, if this is a must for you you can always 
> compress the data manually before writing it to disk, and decompress it after 
> the reading process.

Hmmm... that's a shame. Is there really no provision for it or is it
just hard to set up? I'll have to think this over, then. I do need
compression, because I'm basically storing parts of a terabyte dataset
on my Eee pc, with which I'm very happy because of its weight and
easy-to-travel-with design, but is a bit underpowered for real world
data analysis. That may rule out compression because of CPU cycles, now
that I think about it, :-/ Well, I'll try compression, serializing and
storing as a string.

Thanks,

David


------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables 
unlimited royalty-free distribution of the report engine 
for externally facing server and web deployment. 
http://p.sf.net/sfu/businessobjects
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to