Hello Anthony,

Thank you for replying back with suggestions.

In response to your suggestions, I am *not reading the data from a file in
the first step, but instead a database*. I did try out your 1st suggestion
of doing a table.append(list of tuples), which took a little more than the
executed time I got with the original code. Can you please guide me in how
to chunk the data (that I got from database and stored as a list of tuples
in Python) ?


Hi Shyam,

The pattern that you are using to write to a table is basically one for
writing Python data to HDF5.  However, your data is already in a machine /
HDF5 native format.  Thus what you are doing here is an excessive amount of
work:  read data from file -> convert to Python data structures -> covert
back to HDF5 data structures -> write to file.

When reading from a table you get back a numpy structured array (look them
up on the numpy website).

Then instead of using rows to write back the data, just use Table.append()
[1] which lets you pass in a bunch of rows simultaneously.  (Note: that you
data in this case is too large to fit into memory, so you may have to spit
it up into chunks or use the new iterators which are in the development

Additionally, if all you are doing is copying a table wholesale, you should
use the Table.copy(). [2]  Or if you only want to copy some subset based on
a conditional you provide, use whereAppend() [3].

Finally, if you want to do math or evaluate expressions on one table to
create a new table, use the Expr class [4].

All of these will be waaaaay faster than what you are doing right now.

Be Well

4. http://pytables.github.io/usersguide/libref/expr_class.html

On Thu, Apr 11, 2013 at 12:23 PM, Shyam Parimal Katti <spk...@nyu.edu>wrote:

> Hello,
> I am writing a lot of data(close to 122GB ) to a hdf5 file using PyTables.
> The execution time for writing the query result to the file is close to 10
> hours, which includes querying the database and then writing to the file.
> When I timed the entire execution, I found that it takes as much time to
> get the data from the database as it takes to write to the hdf5 file. Here
> is the small snippet(P.S: the execution time noted below is not for 122GB
> data, but a small subset close to 10GB):
> class ContactClass(table.IsDescription):
>     name= tb.StringCol(4200)
>     address= tb.StringCol(4200)
>     emailAddr= tb.StringCol(180)
>     phone= tb.StringCol(256)
> h5File= table.openFile(<file name>, mode="a", title= "Contacts")
> t= h5File.createTable(h5File.root, 'ContactClass', ContactClass,
> filters=table.Filters(5, 'blosc'), expectedrows=77806938)
> resultSet= get data from database
> currRow= t.row
> print("Before appending data: %s" % str(datetime.now()))
> for (attributes ..) in resultSet:
>      currRow['name']= attribute[0]
>      currRow['address']= attribute[1]
>      currRow['emailAddr']= attribute[2]
>      currRow['phone']= attribute[3]
>      currRow.append()
> print("After done appending: %s" % str(datetime.now()))
> t.flush()
> print("After done flushing: %s" % str(datetime.now()))
> .. gives me:
> *Before appending data  2013-04-11 10:42:39.903713  *
> *After done appending: 2013-04-11 11:04:10.002712*
> *After done flushing: 2013-04-11 11:05:50.059893*
> *
> *
> it seems like append() takes a lot of time. Any suggestions on how to
> improve this?
> Thanks,
> Shyam
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
Pytables-users mailing list

Reply via email to