Hi, Inspired by the great rewrite of numpy.linalg in 1.8, I've spent the last couple of days coding a couple of the functions in numpy.lib as gufuncs, namely np.interp and np.bincount. I want to do something along the same lines to np.digitize, but haven't started on it just yet. I'm currently keeping the code in a separate repository ( https://github.com/jaimefrio/new_gufuncs), because I have a lot of problems compiling numpy, and this makes my life much easier while developing. If there's interest in adding these code to numpy, I´ll be more than happy to figure my compilation issues and turn this into a PR.
If you run python setup.py install, you should then be able to `import new_gufuncs as ng`. Basically, there are 4 gufuncs (ng._interp, ng._minmax, ng._bincount, ng._bincount_wt) and 2 Python wrappers to them (ng.interp, ng.bincount) that reproduce the interface of their numpy counterparts. Some obvious advantages: 1. Added support for more types, including complex, in both the interp data points and the bincount weights. 2. You can use ng.bincount to do things like clustering, or ng.interp to interpolate vector functions of a single variable. 3. It makes the code much more flexible, I'm sure people will come up with new uses. Some things I'm concerned about: 1. For very small datasets, performance is several times worse than the current functions. For larger datasets the gufunc versions perform better than the current functions, but the gufunc dispatch mechanism seems to take for ever to get going. 2. There's what feels like a very nasty hack in the bincount gufuncs: since the shape of the output depends on the contents of the input arrays, the signature cannot be determined beforehand. So the gufunc has signature '(m)->(n)', which means that the wrapper function has to determine what the shape of the output is, instantiate it, and pass it in with the 'out' keyword argument to the gufunc for things to work properly. IT works beautifully, but it just doesn't seem right. 3. The broadcasting syntax doesn't seem to be the most convenient for the obvious applications: 3.1. Say you have a function R->R^3. The obvious way of describing it as piecewise linear would be to have xp of shape (n,) and yp of shape (n, 3). To run ng.interp on a set of points x of shape (m,), and get back an output of shape (m, 3) you would have to do ng.interp(x, xp, yp.T).T. 3.2. Say you have counts of terms in documents. The typical way of storing this would be in an array `dataset` of shape (docs, terms). If you then run clustering on that data, you would typically get a `clusters` array of shape (docs,). To cluster the counts you would do something like ng.bincount(clusters, dataset.T).T Those transpositions in these two cases seem a little annoying, but then again "special cases aren't special enough to break the rules," or is it "although practicality beats purity" that applies here? Any feedback on any of the above points (or others) is more than welcome. If you think this is a worthy addition to numpy, I'll work on some tests and performance benchmarks. Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion