A Monday 06 December 2010 22:00:29 Wai Yip Tung escrigué: > Thank you for the quick response and Christopher's explanation on the > design background. > > All my tables fit in-memory. I want to explore the data interactively > and relational database is does not provide me a lot of value. > > I was rolling my own library before I come to numpy. Then I find > numpy's universal function awesome and really fit what I want to do. > Now I just need to find out what to add row which is easy in Python. > It is OK if it rebuild an array when I add a column, which should > happen infrequently. But if adding row build a new array, this will > lead to O(n^2) complexity. In anycase, I will explore the > recfunctions.
If you want a container with a better complexity for adding columns than O(n^2), you may want to have a look at the ctable object in carray package: https://github.com/FrancescAlted/carray carray is about providing compressed, in-memory data containers for both homogeneous (arrays) and heterogeneous data (structured arrays). Here it is an example of use: >>> import numpy as np >>> import carray as ca >>> NR = 1000*1000 >>> r = np.fromiter(((i,i*i) for i in xrange(NR)), dtype="i4,i8") >>> new_field = np.arange(NR, dtype='f8')**3 >>> rc = ca.ctable(r) >>> rc ctable((1000000,), [('f0', '<i4'), ('f1', '<i8')]) nbytes: 11.44 MB; cbytes: 1.71 MB; ratio: 6.70 [(0, 0), (1, 1), (2, 4), ..., (999997, 999994000009), (999998, 999996000004), (999999, 999998000001)] >>> time rc.addcol(new_field, "f2") CPU times: user 0.03 s, sys: 0.00 s, total: 0.03 s Wall time: 0.03 s that is, only 30 ms for appending a column. This is basically the time to copy (and compress) the data (i.e. O(n)). If you append an already compressed column, the cost of adding it is O(1): >>> r = np.fromiter(((i,i*i) for i in xrange(NR)), dtype="i4,i8") >>> rc = ca.ctable(r) >>> cnew_field = ca.carray(np.arange(NR, dtype='f8')**3) >>> time rc.addcol(cnew_field, "f2") CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s Wall time: 0.00 s On his hand, using plain structured arrays is pretty more costly: >>> import numpy.lib.recfunctions as nprf >>> time r2 = nprf.rec_append_fields(r, 'f2', new_field, 'f8') CPU times: user 0.34 s, sys: 0.02 s, total: 0.36 s Wall time: 0.36 s Appending data at the end of ctable objects is also very fast: >>> timeit rc.append(row) 100000 loops, best of 3: 13.1 µs per loop Compare this with an append with an structured array: >>> timeit np.concatenate((r2, row)) 100 loops, best of 3: 6.84 ms per loop Unfortunately you cannot do the full range of operations supported by structured arrays with ctables, and a ctable object is rather meant to be used as an efficient, compressed container for structures in memory: >>> r2[2] (2, 4, 8.0) >>> rc[2] (2, 4, 8.0) >>> r2['f1'] array([0, 1, 4, ..., 1, 1, 1]) >>> rc['f1'] carray((1452223,), int64) nbytes: 11.08 MB; cbytes: 1.62 MB; ratio: 6.85 cparams := cparams(clevel=5, shuffle=True) [0, 1, 4, ..., 1, 1, 1] But still, you can do funny things like complex queries: >>> [r for r in rc.getif("(f0<10)&(f2>4)", ["__nrow__", "f1"])] [(2, 4), (3, 9), (4, 16), (5, 25), (6, 36), (7, 49), (8, 64), (9, 81), (1041112, 1)] The queries are also very fast (both Numexpr and Blosc are used under the hood): >>> timeit [r for r in rc.getif("(f0<10)&(f2>4)")] 10 loops, best of 3: 58.6 ms per loop >>> timeit r2[(r2['f0']<10)&(r2['f2']>4)] 10 loops, best of 3: 28 ms per loop So, queries on ctables are only 2x slower than using plain structured arrays --of course, the secret goal is to make these sort of queries actually faster than using structured arrays :) I still need to finish the docs, but I plan to release carray 0.3 later this week. Cheers, -- Francesc Alted _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
