On Thu, Sep 11, 2014 at 7:47 PM, Charles R Harris <[email protected] > wrote:
> > > On Thu, Sep 11, 2014 at 10:10 AM, Charles R Harris < > [email protected]> wrote: > >> >> >> On Wed, Sep 10, 2014 at 10:08 PM, Nathaniel Smith <[email protected]> wrote: >> >>> On Wed, Sep 10, 2014 at 4:53 PM, Charles R Harris >>> <[email protected]> wrote: >>> > >>> > On Wed, Sep 10, 2014 at 10:52 AM, Pauli Virtanen <[email protected]> wrote: >>> >> >>> >> 09.09.2014, 22:52, Charles R Harris kirjoitti: >>> >> > 1. Should the operator accept array_like for one of the arguments? >>> >> > 2. Does it need to handle __numpy_ufunc__, or will >>> >> > __array_priority__ serve? >>> >> >>> >> I think the __matmul__ operator implementation should follow that of >>> >> __mul__. >>> >> >>> >> [clip] >>> >> > 3. Do we want PyArray_Matmul in the numpy API? >>> >> > 4. Should a matmul function be supplied by the multiarray module? >>> >> > >>> >> > If 3 and 4 are wanted, should they use the __numpy_ufunc__ >>> machinery, or >>> >> > will __array_priority__ serve? >>> >> >>> >> dot() function deals with __numpy_ufunc__, and the matmul() function >>> >> should behave similarly. >>> >> >>> >> It seems dot() uses __array_priority__ for selection of output return >>> >> subclass, so matmul() probably needs do the same thing. >>> >> >>> >> > Note that the type number operators, __add__ and such, currently use >>> >> > __numpy_ufunc__ in combination with __array_priority__, this in >>> addition >>> >> > to >>> >> > the fact that they are by default using ufuncs that do the same. I'd >>> >> > rather >>> >> > that the __*__ operators simply rely on __array_priority__. >>> >> >>> >> The whole business of __array_priority__ and __numpy_ufunc__ in the >>> >> binary ops is solely about when __<op>__ should yield the execution to >>> >> __r<op>__ of the other object. >>> >> >>> >> The rule of operation currently is: "__rmul__ before __numpy_ufunc__" >>> >> >>> >> If you remove the __numpy_ufunc__ handling, it becomes: "__rmul__ >>> before >>> >> __numpy_ufunc__, except if array_priority happens to be smaller than >>> >> that of the other class and your class is not an ndarray subclass". >>> >> >>> >> The following binops also do not IIRC respect __array_priority__ in >>> >> preferring right-hand operand: >>> >> >>> >> - in-place operations >>> >> - comparisons >>> >> >>> >> One question here is whether it's possible to change the behavior of >>> >> __array_priority__ here at all, or whether changes are possible only >>> in >>> >> the context of adding new attributes telling Numpy what to do. >>> > >>> > I was tempted to make it a generalized ufunc, which would take care of >>> a lot >>> > of things, but there is a lot of overhead in those functions. Sounds >>> like >>> > the easiest thing is to make it similar to dot, although having an >>> inplace >>> > versions complicates the type selection a bit. >>> >>> Can we please fix the overhead instead of adding more half-complete >>> implementations of the same concepts? I feel like this usually ends up >>> slowing things down in the end, as optimization efforts get divided... >>> >>> My vote is: >>> >>> __matmul__/__rmatmul__ do the standard dispatch stuff that all __op__ >>> methods do (so I guess check __array_priority__ or whatever it is we >>> always do). I'd also be okay with ignoring __array_priority__ on the >>> grounds that __numpy_ufunc__ is better, and there's no existing code >>> relying on __array_priority__ support in __matmul__. >>> >>> Having decided that we are actually going to run, they dispatch >>> unconditionally to np.newdot(a, b) (or newdot(a, b, out=a) for the >>> in-place version), similarly to how e.g. __add__ dispatches to np.add. >>> >>> newdot acts like a standard gufunc with all the standard niceties, >>> including __numpy_ufunc__ dispatch. >>> >>> ("newdot" here is intended as a placeholder name, maybe it should be >>> np.linalg.matmul or something else to be bikeshed later. I also vote >>> that eventually 'dot' become an alias for this function, but whether >>> to do that is an orthogonal discussion for later.) >>> >>> If we went the ufunc route, I think we would want three of them, >> matxvec, vecxmat, and matxmat, because the best inner loops would be >> different in the three cases, but they couldn't be straight ufuncs >> themselves, as we don't need the other options, `reduce`, etc., but they >> can't be exactly like the linalg machinery, because we do want subclasses >> to be able to override. Hmm... >> >> The ufunc machinery has some funky aspects. For instance, there are >> hardwired checks for `__radd__` and other such operators in >> PyUFunc_GenericFunction that allows subclasses to overide the ufunc. Those >> options should really be part of the PyUFuncObject. >> > > Here are some timing results for the inner product of an integer array > comparing inner, ufunc (inner1d), and einsum implementations. The np.long > type was chosen because it is implemented for inner1d and to avoid the > effect of using cblas. > > In [43]: a = ones(100000, dtype=np.long) > > In [44]: timeit inner(a,a) > 10000 loops, best of 3: 55.5 µs per loop > > In [45]: timeit inner1d(a,a) > 10000 loops, best of 3: 56.2 µs per loop > > In [46]: timeit einsum('...i,...i',a,a) > 10000 loops, best of 3: 43.8 µs per loop > > In [47]: a = ones(1, dtype=np.long) > > In [48]: timeit inner(a,a) > 1000000 loops, best of 3: 484 ns per loop > > In [49]: timeit inner1d(a,a) > 1000000 loops, best of 3: 741 ns per loop > > In [50]: timeit einsum('...i,...i',a,a) > 1000000 loops, best of 3: 811 ns per loop > > For big arrays, einsum has a better inner loop. For small arrays, einsum > and inner1d suffer from call overhead. Note that the einsum overhead could > be improved by special casing, and that the various loops could be > consolidated into best of breed. > > The ufunc implementation is easy to do for all cases. > > Same thing for doubles. The speedup due to cblas can be seen, and the iterator overhead becomes more visible. In [51]: a = ones(100000, dtype=np.double) In [52]: timeit inner(a,a) 10000 loops, best of 3: 32.1 µs per loop In [53]: timeit inner1d(a,a) 10000 loops, best of 3: 134 µs per loop In [54]: timeit einsum('...i,...i',a,a) 10000 loops, best of 3: 42 µs per loop In [55]: a = ones(1, dtype=np.double) In [56]: timeit inner(a,a) 1000000 loops, best of 3: 336 ns per loop In [57]: timeit inner1d(a,a) 1000000 loops, best of 3: 742 ns per loop In [58]: timeit einsum('...i,...i',a,a) 1000000 loops, best of 3: 809 ns per loop But stacked vectors make a difference In [59]: a = ones((2, 1), dtype=np.double) In [60]: timeit for i in range(2): inner(a[i],a[i]) 1000000 loops, best of 3: 1.39 µs per loop In [61]: timeit inner1d(a,a) 1000000 loops, best of 3: 749 ns per loop In [62]: timeit einsum('...i,...i',a,a) 1000000 loops, best of 3: 888 ns per loop Chuck
_______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
