Re: [Pytables-users] Row.append() performance

Shyam Parimal Katti Mon, 15 Apr 2013 20:16:53 -0700

Hello Anthony,


Thank you for your suggestions. When I mentioned that I am reading the
data from database, I meant a DB2 database, not a HDF5 database/file.


I followed your suggestions, so the code looks as follows:


def createHDF5File():

         h5File= tables.openFile(<file name>, mode="a")

          table.createTable(h5File.root, "Contact", Contact,
"Contact", expectedrows=7000000)

 .....


def getDataAndWriteHDF5(table):

         databaseConn= pyodbc.connect(<connection string>, <password>)

         cursor= databaseConn.cursor()

         cursor.execute("SQL Query")

          resultSet= np.fromiter(( tuple(row) for row in cursor),
dtype=[('name', numpy.str_, 180), ('address', numpy.str_, 4200),
('email', numpy.str_, 180), ('phone', numpy.str_, 256)])

           table.append(resultSet)



Error message: MemoryError: cannot allocate array memory.


I am setting the `expectedrows` parameter when creating the table in
HDF5 file, and yet encounter the error above. Looking forward to
suggestions.




> Hello Anthony,
>
> Thank you for replying back with suggestions.
>
> In response to your suggestions, I am *not reading the data from a file
> in the first step, but instead a database*.
>

Hello Shyam,

To put too fine a point on it, hdf5 databases are files.  And reading from
any kind of file incurs the same disk read overhead.


>  I did try out your 1st suggestion of doing a table.append(list of
> tuples), which took a little more than the executed time I got with the
> original code. Can you please guide me in how to chunk the data (that I got
> from database and stored as a list of tuples in Python) ?
>

Ahh, so you should not be using list of tuples.  These are Pythonic types
and conversion between HDF5 types and Python types is what is slowing you
down.  You should be passing a numpy structured array into append().  Numpy
types are very similar (and often exactly the same as) HDF5 types.  For
large, continuous, structured data you want to avoid the Python interpreter
as much as possible.  Use Python here as the glue code to compose a series
of fast operations using the APIs exposed by numpy, pytables, etc.

Be Well
Anthony



On Thu, Apr 11, 2013 at 6:16 PM, Shyam Parimal Katti <spk...@nyu.edu> wrote:

> Hello Anthony,
>
> Thank you for replying back with suggestions.
>
> In response to your suggestions, I am *not reading the data from a file
> in the first step, but instead a database*. I did try out your 1st
> suggestion of doing a table.append(list of tuples), which took a little
> more than the executed time I got with the original code. Can you please
> guide me in how to chunk the data (that I got from database and stored as a
> list of tuples in Python) ?
>
>
> Thanks,
> Shyam
>
> Hi Shyam,
>
> The pattern that you are using to write to a table is basically one for
> writing Python data to HDF5.  However, your data is already in a machine /
> HDF5 native format.  Thus what you are doing here is an excessive amount of
> work:  read data from file -> convert to Python data structures -> covert
> back to HDF5 data structures -> write to file.
>
> When reading from a table you get back a numpy structured array (look them
> up on the numpy website).
>
> Then instead of using rows to write back the data, just use Table.append()
> [1] which lets you pass in a bunch of rows simultaneously.  (Note: that you
> data in this case is too large to fit into memory, so you may have to spit
> it up into chunks or use the new iterators which are in the development
> branch.)
>
> Additionally, if all you are doing is copying a table wholesale, you should
> use the Table.copy(). [2]  Or if you only want to copy some subset based on
> a conditional you provide, use whereAppend() [3].
>
> Finally, if you want to do math or evaluate expressions on one table to
> create a new table, use the Expr class [4].
>
> All of these will be waaaaay faster than what you are doing right now.
>
> Be Well
> Anthony
>
> 1.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.append
> 2.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.copy
> 3.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.whereAppend
> 4. http://pytables.github.io/usersguide/libref/expr_class.html
>
>
>
>
> On Thu, Apr 11, 2013 at 12:23 PM, Shyam Parimal Katti <spk...@nyu.edu>wrote:
>
>> Hello,
>>
>> I am writing a lot of data(close to 122GB ) to a hdf5 file using
>> PyTables. The execution time for writing the query result to the file is
>> close to 10 hours, which includes querying the database and then writing to
>> the file. When I timed the entire execution, I found that it takes as much
>> time to get the data from the database as it takes to write to the hdf5
>> file. Here is the small snippet(P.S: the execution time noted below is not
>> for 122GB data, but a small subset close to 10GB):
>>
>> class ContactClass(table.IsDescription):
>>     name= tb.StringCol(4200)
>>     address= tb.StringCol(4200)
>>     emailAddr= tb.StringCol(180)
>>     phone= tb.StringCol(256)
>>
>> h5File= table.openFile(<file name>, mode="a", title= "Contacts")
>> t= h5File.createTable(h5File.root, 'ContactClass', ContactClass,
>> filters=table.Filters(5, 'blosc'), expectedrows=77806938)
>>
>> resultSet= get data from database
>> currRow= t.row
>> print("Before appending data: %s" % str(datetime.now()))
>> for (attributes ..) in resultSet:
>>      currRow['name']= attribute[0]
>>      currRow['address']= attribute[1]
>>      currRow['emailAddr']= attribute[2]
>>      currRow['phone']= attribute[3]
>>      currRow.append()
>> print("After done appending: %s" % str(datetime.now()))
>> t.flush()
>> print("After done flushing: %s" % str(datetime.now()))
>>
>> .. gives me:
>> *Before appending data  2013-04-11 10:42:39.903713  *
>> *After done appending: 2013-04-11 11:04:10.002712*
>> *After done flushing: 2013-04-11 11:05:50.059893*
>> *
>> *
>> it seems like append() takes a lot of time. Any suggestions on how to
>> improve this?
>>
>> Thanks,
>> Shyam
>>
>>
>

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Row.append() performance

Reply via email to