Re: [Pytables-users] File bloat with row.update()

Francesc Alted Tue, 09 Nov 2010 10:42:46 -0800

A Tuesday 09 November 2010 18:45:52 David E. Sallis escrigué:
> Francesc, sorry this took so long, but I'm back.  I have upgraded to
> PyTables 2.2, HDF 1.8.5-patch1, Numpy 1.5.0, Numexpr 1.4.1, and
> Cython 0.13.  I'm still running Python 2.6.5 (actually Stackless
> Python) under Linux RedHat 5.
> 
> I have attached a script, 'bloat.py', which (at least on my systems)
> can reproduce the problem.  The script creates an HDF-5 file with a
> single Table containing 10000 rows of three columns.
> 
> Usage is 'python bloat.py create' to create the file, and 'python
> bloat.py update' to perform updates on the file after creating it. 
> After each run the script prints out the size of the file after the
> operation is complete.
> 
> I have a clue to impart to you, to assist in figuring out what's
> going on. While writing this test script I played around with
> turning compression on and off; that is, conducting a series of runs
> with a Filter defined for the file and Table, and a series of runs
> without the compression filter.   What I am seeing is that, with no
> compression filter defined, the file is significantly larger (which
> is to be expected), but there is no file size increase with
> subsequent updates.  When compression is used, the file size
> increases with each update operation.
[clip]


After having a look at you script, yes, I think this is the expected 
behaviour.  In order to explain this you need to know how HDF5 stores 
its data internally.  For chunked datasets (the Table object is an 
example of this), the I/O is done in terms of complete chunks.  Each 
chunk is then passed to the filters (if any) for compression (or other 
operations).

In this case, when you are creating the table and using compression, the 
chunks are compressed very well, and take very little space on disk.  
But, when you are *updating* the existing data, you are introducing more 
entropy and compression does not work as efficiently.  As a consequence, 
the resulting chunks are larger than the original ones on-disk, and 
hence they need to be saved in other place (normally at the end of the 
file).  HDF5 cannot presently remove (nor reuse) the old chunks in an 
easy way, and have to book new space for such a resulting chunks.  The 
only way to make the space taken by 'old' chunks is to 'repack' the HDF5 
file (as you have already noticed).

Of course, when you are not using compression, update can be done safely 
using the 'old' chunks, as they take exactly the same space on-disk.  
This is why uncompressed tables does not expose the 'strange' behaviour.

Hope this helps,

-- 
Francesc Alted

------------------------------------------------------------------------------
The Next 800 Companies to Lead America's Growth: New Video Whitepaper
David G. Thomson, author of the best-selling book "Blueprint to a 
Billion" shares his insights and actions to help propel your 
business during the next growth cycle. Listen Now!
http://p.sf.net/sfu/SAP-dev2dev
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] File bloat with row.update()

Reply via email to