Re: [Pytables-users] Row.append() performance

Anthony Scopatz Mon, 15 Apr 2013 20:58:28 -0700

Hello Shyam,

Can you please post the full traceback?  In any event, I am fairly certain
that this error is coming from the np.fromiter step.  The problem here is
that you are trying to read yur entire SQL query into a single numpy array
in memory.  This is impossible because you don't have enough RAM.
Therefore, you are going to need to read and write in chunks.  Something
like the following:


def getDataAndWriteHDF5(table):
    databaseConn= pyodbc.connect(<connection string>, <password>)
    cursor= databaseConn.cursor()
    cursor.execute("SQL Query")
    dt = np.dtype([('name', numpy.str_, 180), ('address', numpy.str_,
4200),
                   ('email', numpy.str_, 180), ('phone', numpy.str_, 256)])
    citer = iter(cursor)
    chunksize = 4096  # This is just a guess, other values might work better
    crange = range(chunksize)
    while True:
        resultSet = np.fromiter((tuple(row) for i, row in zip(crange,
citer)), dtype=dt)
        table.append(resultSet)
        if len(resultSet) < chunksize:
            break

You may want to tweak some things, but that is the basic strategy.

Be Well
Anthony


On Mon, Apr 15, 2013 at 10:16 PM, Shyam Parimal Katti <spk...@nyu.edu>wrote:

> Hello Anthony,
>
>
> Thank you for your suggestions. When I mentioned that I am reading the data 
> from database, I meant a DB2 database, not a HDF5 database/file.
>
>
> I followed your suggestions, so the code looks as follows:
>
>
> def createHDF5File():
>
>          h5File= tables.openFile(<file name>, mode="a")
>
>           table.createTable(h5File.root, "Contact", Contact, "Contact", 
> expectedrows=7000000)
>
>  .....
>
>
> def getDataAndWriteHDF5(table):
>
>          databaseConn= pyodbc.connect(<connection string>, <password>)
>
>          cursor= databaseConn.cursor()
>
>          cursor.execute("SQL Query")
>
>           resultSet= np.fromiter(( tuple(row) for row in cursor), 
> dtype=[('name', numpy.str_, 180), ('address', numpy.str_, 4200), ('email', 
> numpy.str_, 180), ('phone', numpy.str_, 256)])
>
>            table.append(resultSet)
>
>
>
> Error message: MemoryError: cannot allocate array memory.
>
>
>
> I am setting the `expectedrows` parameter when creating the table in HDF5 
> file, and yet encounter the error above. Looking forward to suggestions.
>
>
>
>
>
> > Hello Anthony,
> >
> > Thank you for replying back with suggestions.
> >
> > In response to your suggestions, I am *not reading the data from a file
> > in the first step, but instead a database*.
> >
>
> Hello Shyam,
>
> To put too fine a point on it, hdf5 databases are files.  And reading from
> any kind of file incurs the same disk read overhead.
>
>
> >  I did try out your 1st suggestion of doing a table.append(list of
> > tuples), which took a little more than the executed time I got with the
> > original code. Can you please guide me in how to chunk the data (that I got
> > from database and stored as a list of tuples in Python) ?
> >
>
> Ahh, so you should not be using list of tuples.  These are Pythonic types
> and conversion between HDF5 types and Python types is what is slowing you
> down.  You should be passing a numpy structured array into append().  Numpy
> types are very similar (and often exactly the same as) HDF5 types.  For
> large, continuous, structured data you want to avoid the Python interpreter
> as much as possible.  Use Python here as the glue code to compose a series
> of fast operations using the APIs exposed by numpy, pytables, etc.
>
> Be Well
> Anthony
>
>
>
> On Thu, Apr 11, 2013 at 6:16 PM, Shyam Parimal Katti <spk...@nyu.edu>wrote:
>
>> Hello Anthony,
>>
>> Thank you for replying back with suggestions.
>>
>> In response to your suggestions, I am *not reading the data from a file
>> in the first step, but instead a database*. I did try out your 1st
>> suggestion of doing a table.append(list of tuples), which took a little
>> more than the executed time I got with the original code. Can you please
>> guide me in how to chunk the data (that I got from database and stored as a
>> list of tuples in Python) ?
>>
>>
>> Thanks,
>> Shyam
>>
>>
>> Hi Shyam,
>>
>> The pattern that you are using to write to a table is basically one for
>> writing Python data to HDF5.  However, your data is already in a machine /
>> HDF5 native format.  Thus what you are doing here is an excessive amount of
>> work:  read data from file -> convert to Python data structures -> covert
>> back to HDF5 data structures -> write to file.
>>
>> When reading from a table you get back a numpy structured array (look them
>> up on the numpy website).
>>
>> Then instead of using rows to write back the data, just use Table.append()
>> [1] which lets you pass in a bunch of rows simultaneously.  (Note: that you
>> data in this case is too large to fit into memory, so you may have to spit
>> it up into chunks or use the new iterators which are in the development
>> branch.)
>>
>> Additionally, if all you are doing is copying a table wholesale, you should
>> use the Table.copy(). [2]  Or if you only want to copy some subset based on
>> a conditional you provide, use whereAppend() [3].
>>
>> Finally, if you want to do math or evaluate expressions on one table to
>> create a new table, use the Expr class [4].
>>
>> All of these will be waaaaay faster than what you are doing right now.
>>
>> Be Well
>> Anthony
>>
>> 1.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.append
>> 2.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.copy
>> 3.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.whereAppend
>> 4. http://pytables.github.io/usersguide/libref/expr_class.html
>>
>>
>>
>>
>> On Thu, Apr 11, 2013 at 12:23 PM, Shyam Parimal Katti <spk...@nyu.edu>wrote:
>>
>>> Hello,
>>>
>>> I am writing a lot of data(close to 122GB ) to a hdf5 file using
>>> PyTables. The execution time for writing the query result to the file is
>>> close to 10 hours, which includes querying the database and then writing to
>>> the file. When I timed the entire execution, I found that it takes as much
>>> time to get the data from the database as it takes to write to the hdf5
>>> file. Here is the small snippet(P.S: the execution time noted below is not
>>> for 122GB data, but a small subset close to 10GB):
>>>
>>> class ContactClass(table.IsDescription):
>>>     name= tb.StringCol(4200)
>>>     address= tb.StringCol(4200)
>>>     emailAddr= tb.StringCol(180)
>>>     phone= tb.StringCol(256)
>>>
>>> h5File= table.openFile(<file name>, mode="a", title= "Contacts")
>>> t= h5File.createTable(h5File.root, 'ContactClass', ContactClass,
>>> filters=table.Filters(5, 'blosc'), expectedrows=77806938)
>>>
>>> resultSet= get data from database
>>> currRow= t.row
>>> print("Before appending data: %s" % str(datetime.now()))
>>> for (attributes ..) in resultSet:
>>>      currRow['name']= attribute[0]
>>>      currRow['address']= attribute[1]
>>>      currRow['emailAddr']= attribute[2]
>>>      currRow['phone']= attribute[3]
>>>      currRow.append()
>>> print("After done appending: %s" % str(datetime.now()))
>>> t.flush()
>>> print("After done flushing: %s" % str(datetime.now()))
>>>
>>> .. gives me:
>>> *Before appending data  2013-04-11 10:42:39.903713  *
>>> *After done appending: 2013-04-11 11:04:10.002712*
>>> *After done flushing: 2013-04-11 11:05:50.059893*
>>> *
>>> *
>>> it seems like append() takes a lot of time. Any suggestions on how to
>>> improve this?
>>>
>>> Thanks,
>>> Shyam
>>>
>>>
>>
>
>
> ------------------------------------------------------------------------------
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis & visualization. Get a free account!
> http://www2.precog.com/precogplatform/slashdotnewsletter
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Row.append() performance

Reply via email to