On Fri, Mar 22, 2013 at 7:11 AM, Julio Trevisan <juliotrevi...@gmail.com>wrote:

> Hi,
>
> I just joined this list, I am using PyTables for my project and it works
> great and fast.
>
> I am just trying to optimize some parts of the program and I noticed that
> zipping the tuples to get one tuple per column takes much longer than
> reading the data itself. The thing is that readWhere() returns one tuple
> per row, whereas I I need one tuple per column, so I have to use the zip()
> function to achieve this. Is there a way to skip this zip() operation?
> Please see below:
>
>
>     def quote_GetData(self, period, name, dt1, dt2):
>         """Returns timedata.Quotes object.
>
>         Arguments:
>           period -- value from within infogetter.QuotePeriod
>           name -- quote symbol
>           dt1, dt2 -- datetime.datetime or timestamp values
>
>         """
>         t = time.time()
>         node = self.quote_GetNode(period, name)
>         ts1 = misc.datetime2timestamp(dt1)
>         ts2 = misc.datetime2timestamp(dt2)
>
>         L = node.readWhere( \
>                    "(timestamp/1000 >= %f) & (timestamp/1000 <= %f)" % \
>                    (ts1/1000, ts2/1000))
>         rowNum = len(L)
>         Q = timedata.Quotes()
>         print "%s: took %f seconds to do everything else" % (name,
> time.time()-t)
>
>         t = time.time()
>         if rowNum > 0:
>             (Q.timestamp, Q.open, Q.close, Q.high, Q.low, Q.volume, \
>              Q.numTrades) = zip(*L)
>         print "%s: took %f seconds to ZIP" % (name, time.time()-t)
>         return Q
>
> *And the printout:*
> BOVESPA.VISTA.PETR4: took 0.068788 seconds to do everything else
> BOVESPA.VISTA.PETR4: took 0.379910 seconds to ZIP
>

Hi Julio,

The problem here isn't zip (packing and un-packing are generally
fast operations -- they happen *all* the time in Python).    Nor is the
problem specifically with PyTables.  Rather this is an issue with how you
are using numpy structured arrays (look them up).  Basically, this is slow
because you are creating a list of column tuples where every element is a
Python object of the corresponding type.  For example  upcasting every
32-bit integer to a Python int is very expensive!

What you *should* be doing is keeping the columns as numpy arrays, which
keeps the memory layout small, continuous, fast, and if done right does not
require a copy (which you are doing now).

The value of L here is a structured array.  So say I have some
other structured array with 4 fields, the right way to do this is to pull
out each field individually by indexing

a, b, c, d = x['a'], x['b'], x['c'], x['d']

or more generally (for all fields):

a, b, c, d = map(lambda x: i[x], i.dtype.names)

or for some list of fields:

a, c, b = map(lambda x: i[x], ['a', 'c', 'b'])

Timing both your original method and the new one gives:

In [47]: timeit a, b, c, d = zip(*i)
1000 loops, best of 3: 1.3 ms per loop

In [48]: timeit a, b, c, d = map(lambda x: i[x], i.dtype.names)
100000 loops, best of 3: 2.3 µs per loop

So the method I propose is 500x-1000x times faster.  Using numpy
idiomatically is very important!

Be Well
Anthony


>
>
>
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_mar
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to