A Friday 26 November 2010 16:15:54 Gerrit Holl escrigué:
> Hi,
> 
> I'm trying to find the most efficient way to read multiple fields
> from my table. I noticed that readWhere is really slow:
> 
> In [158]: %timeit A = array([row[field] for row in
> t.root.collocations.where(cond)], dtype='int64')
> 1 loops, best of 3: 649 ms per loop
> 
> In [159]: %timeit B = t.root.collocations.readWhere(cond,
> field=field) 1 loops, best of 3: 7.99 s per loop
> 
> In [160]: (A==B).all()
> Out[160]: True
> 
> Why is readWhere so slow?

That's a good question.  After looking at the code, where() needs to 
traverse the complete table in order to get the results.  On his hand, 
readWhere() first do a complete table traversal (in order to get the 
coordinates that fulfill the conditions), and then tells HDF5 to 
retrieve only these coordinates (via H5Sselect_elements() call) out of 
the table.  This is why, in general, the former option is faster.  

The moral here seems to be that the selection features integrated in 
PyTables are far more efficient than those of HDF5 (this is kind of 
expected, as PyTables is using Numexpr and other techniques behind the 
scenes).

> Now, I really have multiple fields. Neither readWhere, nor
> row.__getitem__, nor the void returned by row.fetch_all_fields
> supports fetching more than one field as a recarray does. Is this by
> design?

By design, or better, because of lack of resources.  Anyway, I agree 
that multiple field fetching would be a nice feature to add.  Would you 
like to add a ticket on this?

> What is the preferred way to fetch multiple fields? I would
> like to get a ndarray with the requested fields, of course
> maintaining the dtype. If I write:
> 
> In [218]: %timeit A = [[row[f] for f in fields] for row in
> t.root.collocations.where(cond)]
> 1 loops, best of 3: 1.19 s per loop
> 
> I lose the dtype
> 
> I could do the slower and more memory-intensive
> 
> In [217]: %timeit A = [row.fetch_all_fields() for row in
> t.root.collocations.where(cond)]
> 1 loops, best of 3: 1.56 s per loop
> 
> but then I really need
> 
> In [235]: %timeit A = [row.fetch_all_fields() for row in
> t.root.collocations.where(cond)]; A = array(A,
> dtype=A[0].dtype)[fields]
> 1 loops, best of 3: 2.08 s per loop
> 
> which is faster and less ugly than
> 
> In [351]: %timeit A = [tuple(row[f] for f in fields) for row in
> t.root.collocations.where(cond)]; A = array(A,
> dtype=dtype(zip(fields, (t.root.collocations.dtype[f] for f in
> fields))))
> 1 loops, best of 3: 3.07 s per loop
> 
> but probably needs more memory.
> 
> What is the best way to go here?

Well, I think you summarized very well the state of things here, so it 
is up to you ;-)

> Note that my toy example table has only 1.5 million rows, but in
> production use it will be closer to several hundred million rows.

Hmmm, for these amounts I'd say that indexing some columns with PyTables 
Pro would help you a lot.  But I'd bet you already suspected that.

-- 
Francesc Alted

------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to