Re: [Pytables-users] Row.append() performance

Anthony Scopatz Thu, 11 Apr 2013 11:15:04 -0700

Hi Shyam,

The pattern that you are using to write to a table is basically one for
writing Python data to HDF5.  However, your data is already in a machine /
HDF5 native format.  Thus what you are doing here is an excessive amount of
work:  read data from file -> convert to Python data structures -> covert
back to HDF5 data structures -> write to file.


When reading from a table you get back a numpy structured array (look them
up on the numpy website).

Then instead of using rows to write back the data, just use Table.append()
[1] which lets you pass in a bunch of rows simultaneously.  (Note: that you
data in this case is too large to fit into memory, so you may have to spit
it up into chunks or use the new iterators which are in the development
branch.)

Additionally, if all you are doing is copying a table wholesale, you should
use the Table.copy(). [2]  Or if you only want to copy some subset based on
a conditional you provide, use whereAppend() [3].

Finally, if you want to do math or evaluate expressions on one table to
create a new table, use the Expr class [4].

All of these will be waaaaay faster than what you are doing right now.

Be Well
Anthony

1.
http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.append
2.
http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.copy
3.
http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.whereAppend
4. http://pytables.github.io/usersguide/libref/expr_class.html


On Thu, Apr 11, 2013 at 11:23 AM, Shyam Parimal Katti <spk...@nyu.edu>wrote:

> Hello,
>
> I am writing a lot of data(close to 122GB ) to a hdf5 file using PyTables.
> The execution time for writing the query result to the file is close to 10
> hours, which includes querying the database and then writing to the file.
> When I timed the entire execution, I found that it takes as much time to
> get the data from the database as it takes to write to the hdf5 file. Here
> is the small snippet(P.S: the execution time noted below is not for 122GB
> data, but a small subset close to 10GB):
>
> class ContactClass(table.IsDescription):
>     name= tb.StringCol(4200)
>     address= tb.StringCol(4200)
>     emailAddr= tb.StringCol(180)
>     phone= tb.StringCol(256)
>
> h5File= table.openFile(<file name>, mode="a", title= "Contacts")
> t= h5File.createTable(h5File.root, 'ContactClass', ContactClass,
> filters=table.Filters(5, 'blosc'), expectedrows=77806938)
>
> resultSet= get data from database
> currRow= t.row
> print("Before appending data: %s" % str(datetime.now()))
> for (attributes ..) in resultSet:
>      currRow['name']= attribute[0]
>      currRow['address']= attribute[1]
>      currRow['emailAddr']= attribute[2]
>      currRow['phone']= attribute[3]
>      currRow.append()
> print("After done appending: %s" % str(datetime.now()))
> t.flush()
> print("After done flushing: %s" % str(datetime.now()))
>
> .. gives me:
> *Before appending data  2013-04-11 10:42:39.903713  *
> *After done appending: 2013-04-11 11:04:10.002712*
> *After done flushing: 2013-04-11 11:05:50.059893*
> *
> *
> it seems like append() takes a lot of time. Any suggestions on how to
> improve this?
>
> Thanks,
> Shyam
>
>
>
> ------------------------------------------------------------------------------
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis & visualization. Get a free account!
> http://www2.precog.com/precogplatform/slashdotnewsletter
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Row.append() performance

Reply via email to