On 4/18/12 12:33 PM, Alvaro Tejero Cantero wrote:
> A single array with 312 000 000 int 16 values.
>
> Two (uncompressed) ways to store it:
>
> * Array
>
>>>> wa02[:10]
> array([306, 345, 353, 335, 345, 345, 356, 341, 338, 357], dtype=int16
>
> * Table wtab02 (single column, named 'val')
>>>> wtab02[:10]
> array([(306,), (345,), (353,), (335,), (345,), (345,), (356,), (341,),
>         (338,), (357,)],
>        dtype=[('val', '<i2')])
>
> read time respectively 120 ms, 220 ms.
>
>>>> timeit big=np.nonzero(wa02[:]>1)
> 1 loops, best of 3: 1.66 s per loop
>
>>>> timeit bigtab=wtab02.getWhereList('val>1')
> 1 loops, best of 3: 119 s per loop

Yes, this is expected.  The fact that one method is much faster than the 
other is precisely that one is designed for operating out-of-core, while 
the other is operating completely in-memory, and this has a cost.  But 
that does not mean that out-of-core has to be necessarily slower.  Look 
at this:

In [107]: da
Out[107]:
/da (Array(10000000,)) ''
   atom := Int16Atom(shape=(), dflt=0)
   maindim := 0
   flavor := 'numpy'
   byteorder := 'little'
   chunkshape := None

In [108]: dra
Out[108]:
/dra (Table(10000000,), shuffle, blosc(5)) ''
   description := {
   "a": Int16Col(shape=(), dflt=0, pos=0)}
   byteorder := 'little'
   chunkshape := (65536,)

In [127]: time r = np.argwhere(da[:] == 1)
CPU times: user 0.08 s, sys: 0.02 s, total: 0.10 s
Wall time: 0.10 s

In [111]: time l = dra.getWhereList('a == 1')
CPU times: user 0.10 s, sys: 0.01 s, total: 0.11 s
Wall time: 0.11 s

So, tables' getWhereList() perfomance is pretty close to NumPy, even if 
the former is using compression.  This is a great achievement.  Why I'm 
getting very different results than you is this:

In [119]: len(l)
Out[119]: 153

That is, the selectivity of the query is extremely high (153 out of 10 
million elements), which is the scenario where queries are designed to 
shine.  If you use indexing, then you can get even more speed:

In [131]: dra.cols.a.createCSIndex()
Out[131]: 10000000

In [132]: time l = dra.getWhereList('a == 1')
CPU times: user 0.02 s, sys: 0.01 s, total: 0.03 s
Wall time: 0.02 s

In your case, using small selectivities (you are asking possibly for 
almost 50% of the initial datasets, perhaps less or perhaps more, 
depending on your data pattern), makes the data object creation (one for 
iteration in loop) in PyTables the big overhead:

In [134]: time r = np.argwhere(da[:] > 1)
CPU times: user 1.03 s, sys: 0.03 s, total: 1.06 s
Wall time: 1.12 s

In [135]: time l = dra.getWhereList('a > 1')
CPU times: user 5.62 s, sys: 0.16 s, total: 5.78 s
Wall time: 5.89 s

Now getWhereList() is more than 5x times slower.  Removing the index 
helps a bit here:

In [136]: dra.cols.a.removeIndex()

In [137]: time l = dra.getWhereList('a > 1')
CPU times: user 5.10 s, sys: 0.12 s, total: 5.22 s
Wall time: 5.30 s

But, if the internal query machinery in PyTables is the same, why it 
takes longer?  The short answer is object creation (and some data 
copy).  getWhereList() can be expressed like this:

In [165]: time l = np.array([r.nrow for r in dra.where('a > 1')])
CPU times: user 5.54 s, sys: 0.09 s, total: 5.63 s
Wall time: 5.71 s

Now, if we count the time to get the coordinates only:

In [159]: time s = [r.nrow for r in dra.where('a > 1')]
CPU times: user 3.86 s, sys: 0.08 s, total: 3.95 s
Wall time: 4.02 s

This time is a bit long, but this is due to the .nrow implementation (a 
Cython property of the Row class; I wonder if this could be accelerated 
somewhat).  In general, the Row iterator can be much faster, like for 
example, in getting values:

In [161]: time s = [r['a'] for r in dra.where('a > 1')]
CPU times: user 1.57 s, sys: 0.07 s, total: 1.63 s
Wall time: 1.61 s

and you can notice that this is barely the time that it takes a pure 
list creation:

In [139]: time l = [r for r in xrange(len(l))]
CPU times: user 1.44 s, sys: 0.11 s, total: 1.55 s
Wall time: 1.53 s

So, the 'slow' times that you are seeing are a consequence of the 
different data object creation and the internal data copies (for 
building the final NumPy array).  NumPy is much faster because all this 
process is made in pure C.

But again, this does not preclude the fact that queries in PyTables are 
actually fast --and potentially much faster than NumPy for high 
selectivities and indexing.

Hope this helps,

-- 
Francesc Alted


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to