Re: [Pytables-users] File bloat with row.update()

David E. Sallis Tue, 09 Nov 2010 09:46:34 -0800

David E. Sallis said the following on 9/23/2010 8:36 AM:
> Francesc Alted said the following on 9/23/2010 2:39 AM:
>> A Wednesday 22 September 2010 22:04:53 David E. Sallis escrigué:
>>> I have a table in an HDF5 file consisting of 9 columns and just over
>>> 6000 rows, and an application which performs updates on these table
>>> rows.  The application runs hourly and performs updates to the table
>>> during each run.  No new table rows are added during a run.  I
>>> perform updates to the table by using row.update() inside a
>>> table.where() iterator loop.
>>>
>>> I have noticed that after each application run the size of the file
>>> increases significantly, and over time the file size balloons from
>>> just over 21 MB to well over 750 MB, with no new data being added,
>>> just updated.
>>>
>>> h5repack() run on this file will restore it to its original size with
>>> no loss of data.
>>>
>>> My questions are:
>>>
>>> 1) What causes the file size increase and
>>> 2) is there anything I can do to prevent it?
>>>
>>> I am using PyTables 2.1.1, HDF5 1.8.3, Python 2.6 under Linux RedHat
>>> 5.
>>
>> Hmm, sounds like a bug somewhere.  Could you try with a recent version 
>> of HDF5?  If that does not help, could you please send me a small script 
>> reproducing that?
>> 
> Francesc, thanks.  I'll upgrade to HDF 1.8.5 and PyTables 2.2 and get back to 
> you.


Francesc, sorry this took so long, but I'm back.  I have upgraded to PyTables
2.2, HDF 1.8.5-patch1, Numpy 1.5.0, Numexpr 1.4.1, and Cython 0.13.  I'm still
running Python 2.6.5 (actually Stackless Python) under Linux RedHat 5.

I have attached a script, 'bloat.py', which (at least on my systems) can
reproduce the problem.  The script creates an HDF-5 file with a single Table
containing 10000 rows of three columns.

Usage is 'python bloat.py create' to create the file, and 'python bloat.py
update' to perform updates on the file after creating it.  After each run the
script prints out the size of the file after the operation is complete.

I have a clue to impart to you, to assist in figuring out what's going on.
While writing this test script I played around with turning compression on and
off; that is, conducting a series of runs with a Filter defined for the file and
Table, and a series of runs without the compression filter.   What I am seeing
is that, with no compression filter defined, the file is significantly larger
(which is to be expected), but there is no file size increase with subsequent
updates.  When compression is used, the file size increases with each update
operation.

For example, here is a transcript of a session with compression enabled:

$ python bloat.py create
bloat.hdf initial size is 14206 bytes.
$ python bloat.py update
bloat.hdf size after update is 14881 bytes.
$ python bloat.py update
bloat.hdf size after update is 15508 bytes.
$ python bloat.py update
bloat.hdf size after update is 16328 bytes.
$ python bloat.py update
bloat.hdf size after update is 17000 bytes.
$ python bloat.py update
bloat.hdf size after update is 19766 bytes.
$ python bloat.py update
bloat.hdf size after update is 21550 bytes.
$ python bloat.py update
bloat.hdf size after update is 23300 bytes.
$ python bloat.py update
bloat.hdf size after update is 23959 bytes.
$ python bloat.py update
bloat.hdf size after update is 26094 bytes.
$ python bloat.py update
bloat.hdf size after update is 27426 bytes.
$ python bloat.py update
bloat.hdf size after update is 29170 bytes.

With NO compression:

$ python bloat.py create
bloat.hdf initial size is 920944 bytes.
$ python bloat.py update
bloat.hdf size after update is 920944 bytes.
$ python bloat.py update
bloat.hdf size after update is 920944 bytes.
$ python bloat.py update
bloat.hdf size after update is 920944 bytes.
$ python bloat.py update
bloat.hdf size after update is 920944 bytes.
$ python bloat.py update
bloat.hdf size after update is 920944 bytes.
$ python bloat.py update
bloat.hdf size after update is 920944 bytes.
$ python bloat.py update
bloat.hdf size after update is 920944 bytes.
$ python bloat.py update
bloat.hdf size after update is 920944 bytes.
$ python bloat.py update
bloat.hdf size after update is 920944 bytes.
$ python bloat.py update
bloat.hdf size after update is 920944 bytes.

As always, thank you for your kind attention to this matter.

--David

-- 
David E. Sallis, Senior Principal Engineer, Software
General Dynamics Information Technology
NOAA Coastal Data Development Center
Stennis Space Center, Mississippi
228.688.3805
david.sal...@gdit.com
david.sal...@noaa.gov
--------------------------------------------
"Better Living Through Software Engineering"
--------------------------------------------

#------------------------------------------------------------------------------------------
#
# bloat.py
#
# Description:  Python script to demonstate HDF file bloat when updating rows 
in a table.
# Usage:        bloat.py create  -- Creates and initializes an HDF-5 file.
#               bloat.py update  -- Performs updates on the HDF-5 file.
#
#               After each run the script will print out the size of the file 
in bytes.
#
# The main program drives the addTableRow() method (during file creation) and
# the updateTableRow() method (during file updates).
# 
#------------------------------------------------------------------------------------------

import os, sys, tables
from tables import *

class TableDescriptor(IsDescription):
    COL1 = Int32Col()
    COL2 = Float32Col()
    COL3 = StringCol(80)

#------------------------------------------------------------------------------------------

def setRowValues(row, info):
    """
    Description:   Set values in a Row.
    Input:         info: <Dictionary>    The data to be inserted.  Keys 
correspond
                                         to Row column names.
                   row:  <Pytables.Row>  A PyTables Row object.
    Output:        None.
    Return Value:  <boolean> True if successful, False if not.
    """

    retval = False

    for k in info:
        #
        # Set row values individually.  Invalid column names will throw 
KeyErrors, which will
        # just be ignored.  As long as at least one data item was successfully 
added, this
        # method will return True.
        #
        row[k] = info[k]
        retval = True

    return retval

#------------------------------------------------------------------------------------------

def updateTableRow(row, info):
    """
    Description:  Update a row in a PyTables Table.
    Input:        info:  <Dictionary>   The data to be updated.  Keys correspond
                                        to Row column names.
                  row:   <PyTables.Row) A PyTables Row object.
    Output:       None.
    Return Value: <boolean> True if successful, False if not.
    """

    retval = False

    if setRowValues(row, info):
        row.update()
        retval = True

    return retval

#------------------------------------------------------------------------------------------

def addTableRow(table, info):
    """
    Description:   Add a row to a PyTables Table.
    Input:         info:   <Dictionary>  The data to be inserted.  Keys 
correspond
                              to Row column names.
                   table:  <PyTables.Table> A PyTables Table object.
    Output:        None.
    Return Value:  <boolean> True if successful, False if not.
    """

    retval = False


    row = table.row
    if setRowValues(row, info):
        row.append()
        retval = True

    return retval

#------------------------------------------------------------------------------------------

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print 'Usage:  %s <create|update>'%sys.argv[0]
    else:
        action = sys.argv[1].lower()
        compressor = tables.Filters(complevel=5, complib='zlib')
        #
        # UNCOMMENT THE LINE BELOW TO DISABLE THE COMPRESSION FILTER.
        #
        #compressor = None
        #
        # Create the HDF file and populate with data
        #
        if action == 'create':
            fp = tables.openFile('bloat.hdf','w',filters=compressor)
            tbl = fp.createTable('/',
                                 'Index',
                                 TableDescriptor,
                                 expectedrows=10000,
                                 filters=compressor)
            for i in xrange(10000):
                info = {'COL1':i,
                        'COL2':float(i),
                        'COL3':'CREATE_%d'%i}
                addTableRow(tbl, info)
            #
            # Flush & close.
            #
            tbl.flush()
            fp.flush()
            fp.close()
            #
            # Print out file size.
            #
            s = os.stat('bloat.hdf')
            print 'bloat.hdf initial size is %d bytes.'%s.st_size
        #
        # Open the HDF file, iterate through the table rows and perform updates.
        #
        elif action == 'update':
            fp = tables.openFile('bloat.hdf','a')
            tbl = fp.getNode('/Index')
            #
            # Iterate through table rows, modifying the data slightly.
            #
            for row in tbl.iterrows():
                info = {'COL1':row['COL1']+2,
                        'COL2':row['COL2']+2.0,
                        'COL3':'UPDATE_%d'%(row['COL1']+2)}
                updateTableRow(row, info)
            #
            # Flush & close.
            #
            tbl.flush()
            fp.flush()
            fp.close()
            #
            # Print out file size.
            #
            s = os.stat('bloat.hdf')
            print 'bloat.hdf size after update is %d bytes.'%s.st_size
        else:
            print 'Unknown action %s'%`action`

------------------------------------------------------------------------------
The Next 800 Companies to Lead America's Growth: New Video Whitepaper
David G. Thomson, author of the best-selling book "Blueprint to a 
Billion" shares his insights and actions to help propel your 
business during the next growth cycle. Listen Now!
http://p.sf.net/sfu/SAP-dev2dev

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] File bloat with row.update()

Reply via email to