A single array with 312 000 000 int 16 values. Two (uncompressed) ways to store it:
* Array >>> wa02[:10] array([306, 345, 353, 335, 345, 345, 356, 341, 338, 357], dtype=int16 * Table wtab02 (single column, named 'val') >>> wtab02[:10] array([(306,), (345,), (353,), (335,), (345,), (345,), (356,), (341,), (338,), (357,)], dtype=[('val', '<i2')]) read time respectively 120 ms, 220 ms. >>> timeit big=np.nonzero(wa02[:]>1) 1 loops, best of 3: 1.66 s per loop >>> timeit bigtab=wtab02.getWhereList('val>1') 1 loops, best of 3: 119 s per loop with a Complete Sorted Index on val and blosc9 compression: 1 loops, best of 3: 149 s per loop indicating expectedrows=312 000 000 (so that chunklen goes from 32K to 132K) 1 loops, best of 3: 119 s per loop (I wanted to compare getting a boolean mask, but it seems that Tables don't have a .wheretrue like carrays in Francesc's carray package (?). For reference just the mask times to 344 ms). --- Question: the difference in speed is due to in-core vs out-of-core? If so, and if maximum unit of data fits in memory (even considering loading a few columns to operate among them) -> is the corollary is 'stay in memory at all costs'? With this exercise, I was trying to find out what is the best structure to hold raw data (just one col in this case), and whether indexing could help in queries. -รก. ------------------------------------------------------------------------------ Better than sec? Nothing is better than sec when it comes to monitoring Big Data applications. Try Boundary one-second resolution app monitoring today. Free. http://p.sf.net/sfu/Boundary-dev2dev _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users