On 4/18/12 12:33 PM, Alvaro Tejero Cantero wrote: > A single array with 312 000 000 int 16 values. > > Two (uncompressed) ways to store it: > > * Array > >>>> wa02[:10] > array([306, 345, 353, 335, 345, 345, 356, 341, 338, 357], dtype=int16 > > * Table wtab02 (single column, named 'val') >>>> wtab02[:10] > array([(306,), (345,), (353,), (335,), (345,), (345,), (356,), (341,), > (338,), (357,)], > dtype=[('val', '<i2')]) > > read time respectively 120 ms, 220 ms. > >>>> timeit big=np.nonzero(wa02[:]>1) > 1 loops, best of 3: 1.66 s per loop > >>>> timeit bigtab=wtab02.getWhereList('val>1') > 1 loops, best of 3: 119 s per loop
Yes, this is expected. The fact that one method is much faster than the other is precisely that one is designed for operating out-of-core, while the other is operating completely in-memory, and this has a cost. But that does not mean that out-of-core has to be necessarily slower. Look at this: In [107]: da Out[107]: /da (Array(10000000,)) '' atom := Int16Atom(shape=(), dflt=0) maindim := 0 flavor := 'numpy' byteorder := 'little' chunkshape := None In [108]: dra Out[108]: /dra (Table(10000000,), shuffle, blosc(5)) '' description := { "a": Int16Col(shape=(), dflt=0, pos=0)} byteorder := 'little' chunkshape := (65536,) In [127]: time r = np.argwhere(da[:] == 1) CPU times: user 0.08 s, sys: 0.02 s, total: 0.10 s Wall time: 0.10 s In [111]: time l = dra.getWhereList('a == 1') CPU times: user 0.10 s, sys: 0.01 s, total: 0.11 s Wall time: 0.11 s So, tables' getWhereList() perfomance is pretty close to NumPy, even if the former is using compression. This is a great achievement. Why I'm getting very different results than you is this: In [119]: len(l) Out[119]: 153 That is, the selectivity of the query is extremely high (153 out of 10 million elements), which is the scenario where queries are designed to shine. If you use indexing, then you can get even more speed: In [131]: dra.cols.a.createCSIndex() Out[131]: 10000000 In [132]: time l = dra.getWhereList('a == 1') CPU times: user 0.02 s, sys: 0.01 s, total: 0.03 s Wall time: 0.02 s In your case, using small selectivities (you are asking possibly for almost 50% of the initial datasets, perhaps less or perhaps more, depending on your data pattern), makes the data object creation (one for iteration in loop) in PyTables the big overhead: In [134]: time r = np.argwhere(da[:] > 1) CPU times: user 1.03 s, sys: 0.03 s, total: 1.06 s Wall time: 1.12 s In [135]: time l = dra.getWhereList('a > 1') CPU times: user 5.62 s, sys: 0.16 s, total: 5.78 s Wall time: 5.89 s Now getWhereList() is more than 5x times slower. Removing the index helps a bit here: In [136]: dra.cols.a.removeIndex() In [137]: time l = dra.getWhereList('a > 1') CPU times: user 5.10 s, sys: 0.12 s, total: 5.22 s Wall time: 5.30 s But, if the internal query machinery in PyTables is the same, why it takes longer? The short answer is object creation (and some data copy). getWhereList() can be expressed like this: In [165]: time l = np.array([r.nrow for r in dra.where('a > 1')]) CPU times: user 5.54 s, sys: 0.09 s, total: 5.63 s Wall time: 5.71 s Now, if we count the time to get the coordinates only: In [159]: time s = [r.nrow for r in dra.where('a > 1')] CPU times: user 3.86 s, sys: 0.08 s, total: 3.95 s Wall time: 4.02 s This time is a bit long, but this is due to the .nrow implementation (a Cython property of the Row class; I wonder if this could be accelerated somewhat). In general, the Row iterator can be much faster, like for example, in getting values: In [161]: time s = [r['a'] for r in dra.where('a > 1')] CPU times: user 1.57 s, sys: 0.07 s, total: 1.63 s Wall time: 1.61 s and you can notice that this is barely the time that it takes a pure list creation: In [139]: time l = [r for r in xrange(len(l))] CPU times: user 1.44 s, sys: 0.11 s, total: 1.55 s Wall time: 1.53 s So, the 'slow' times that you are seeing are a consequence of the different data object creation and the internal data copies (for building the final NumPy array). NumPy is much faster because all this process is made in pure C. But again, this does not preclude the fact that queries in PyTables are actually fast --and potentially much faster than NumPy for high selectivities and indexing. Hope this helps, -- Francesc Alted ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users