A Tuesday 09 November 2010 18:45:52 David E. Sallis escrigué: > Francesc, sorry this took so long, but I'm back. I have upgraded to > PyTables 2.2, HDF 1.8.5-patch1, Numpy 1.5.0, Numexpr 1.4.1, and > Cython 0.13. I'm still running Python 2.6.5 (actually Stackless > Python) under Linux RedHat 5. > > I have attached a script, 'bloat.py', which (at least on my systems) > can reproduce the problem. The script creates an HDF-5 file with a > single Table containing 10000 rows of three columns. > > Usage is 'python bloat.py create' to create the file, and 'python > bloat.py update' to perform updates on the file after creating it. > After each run the script prints out the size of the file after the > operation is complete. > > I have a clue to impart to you, to assist in figuring out what's > going on. While writing this test script I played around with > turning compression on and off; that is, conducting a series of runs > with a Filter defined for the file and Table, and a series of runs > without the compression filter. What I am seeing is that, with no > compression filter defined, the file is significantly larger (which > is to be expected), but there is no file size increase with > subsequent updates. When compression is used, the file size > increases with each update operation. [clip]
After having a look at you script, yes, I think this is the expected behaviour. In order to explain this you need to know how HDF5 stores its data internally. For chunked datasets (the Table object is an example of this), the I/O is done in terms of complete chunks. Each chunk is then passed to the filters (if any) for compression (or other operations). In this case, when you are creating the table and using compression, the chunks are compressed very well, and take very little space on disk. But, when you are *updating* the existing data, you are introducing more entropy and compression does not work as efficiently. As a consequence, the resulting chunks are larger than the original ones on-disk, and hence they need to be saved in other place (normally at the end of the file). HDF5 cannot presently remove (nor reuse) the old chunks in an easy way, and have to book new space for such a resulting chunks. The only way to make the space taken by 'old' chunks is to 'repack' the HDF5 file (as you have already noticed). Of course, when you are not using compression, update can be done safely using the 'old' chunks, as they take exactly the same space on-disk. This is why uncompressed tables does not expose the 'strange' behaviour. Hope this helps, -- Francesc Alted ------------------------------------------------------------------------------ The Next 800 Companies to Lead America's Growth: New Video Whitepaper David G. Thomson, author of the best-selling book "Blueprint to a Billion" shares his insights and actions to help propel your business during the next growth cycle. Listen Now! http://p.sf.net/sfu/SAP-dev2dev _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users