Hi,

Inspired by the great rewrite of numpy.linalg in 1.8, I've spent the last
couple of days coding a couple of the functions in numpy.lib as gufuncs,
namely np.interp and np.bincount. I want to do something along the same
lines to np.digitize, but haven't started on it just yet. I'm currently
keeping the code in a separate repository (
https://github.com/jaimefrio/new_gufuncs), because I have a lot of problems
compiling numpy, and this makes my life much easier while developing. If
there's interest in adding these code to numpy, I´ll be more than happy to
figure my compilation issues and turn this into a PR.

If you run python setup.py install, you should then be able to `import
new_gufuncs as ng`.

Basically, there are 4 gufuncs (ng._interp, ng._minmax, ng._bincount,
ng._bincount_wt) and 2 Python wrappers to them (ng.interp, ng.bincount)
that reproduce the interface of their numpy counterparts.

Some obvious advantages:

1. Added support for more types, including complex, in both the interp data
points and the bincount weights.
2. You can use ng.bincount to do things like clustering, or ng.interp to
interpolate vector functions of a single variable.
3. It makes the code much more flexible, I'm sure people will come up with
new uses.

Some things I'm concerned about:

1. For very small datasets, performance is several times worse than the
current functions. For larger datasets the gufunc versions perform better
than the current functions, but the gufunc dispatch mechanism seems to take
for ever to get going.
2. There's what feels like a very nasty hack in the bincount gufuncs: since
the shape of the output depends on the contents of the input arrays, the
signature cannot be determined beforehand. So the gufunc has signature
'(m)->(n)', which means that the wrapper function has to determine what the
shape of the output is, instantiate it, and pass it in with the 'out'
keyword argument to the gufunc for things to work properly. IT works
beautifully, but it just doesn't seem right.
3. The broadcasting syntax doesn't seem to be the most convenient for the
obvious applications:
3.1. Say you have a function R->R^3. The obvious way of describing it as
piecewise linear would be to have xp of shape (n,) and yp of shape (n, 3).
To run ng.interp on a set of points x of shape (m,), and get back an output
of shape (m, 3) you would have to do ng.interp(x, xp, yp.T).T.
3.2. Say you have counts of terms in documents. The typical way of storing
this would be in an array `dataset` of shape (docs, terms). If you then run
clustering on that data, you would typically get a `clusters` array of
shape (docs,). To cluster the counts you would do something like
ng.bincount(clusters, dataset.T).T
Those transpositions in these two cases seem a little annoying, but then
again "special cases aren't special enough to break the rules," or is
it "although
practicality beats purity" that applies here?

Any feedback on any of the above points (or others) is more than welcome.
If you think this is a worthy addition to numpy, I'll work on some tests
and performance benchmarks.

Jaime

-- 
(\__/)
( O.o)
( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes
de dominación mundial.
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to