A single array with 312 000 000 int 16 values.

Two (uncompressed) ways to store it:

* Array

>>> wa02[:10]
array([306, 345, 353, 335, 345, 345, 356, 341, 338, 357], dtype=int16

* Table wtab02 (single column, named 'val')
>>> wtab02[:10]
array([(306,), (345,), (353,), (335,), (345,), (345,), (356,), (341,),
       (338,), (357,)],
      dtype=[('val', '<i2')])

read time respectively 120 ms, 220 ms.

>>> timeit big=np.nonzero(wa02[:]>1)
1 loops, best of 3: 1.66 s per loop

>>> timeit bigtab=wtab02.getWhereList('val>1')
1 loops, best of 3: 119 s per loop

with a Complete Sorted Index on val and blosc9 compression:
1 loops, best of 3: 149 s per loop

indicating expectedrows=312 000 000 (so that chunklen goes from 32K to 132K)
1 loops, best of 3: 119 s per loop

(I wanted to compare getting a boolean mask, but it seems that Tables
don't have a .wheretrue like carrays in Francesc's carray package (?).
For reference just the mask times to 344 ms).

---

Question: the difference in speed is due to in-core vs out-of-core?

If so, and if maximum unit of data fits in memory (even considering
loading a few columns to operate among them) -> is the corollary is
'stay in memory at all costs'?

With this exercise, I was trying to find out what is the best
structure to hold raw data (just one col in this case), and whether
indexing could help in queries.

-รก.

------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second 
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to