I am cross-posting this to Cython user group to make sure they see this. Sturla
Nathaniel Smith <n...@pobox.com> wrote: > On 18 Feb 2014 10:21, "Julian Taylor" <jtaylor.deb...@googlemail.com> wrote: > > On Mon, Feb 17, 2014 at 9:42 PM, Nathaniel Smith <n...@pobox.com> wrote: > On 17 Feb 2014 15:17, "Sturla Molden" <sturla.mol...@gmail.com> wrote: > > Julian Taylor <jtaylor.deb...@googlemail.com> wrote: > > When an array is created it tries to get its memory from the cache > > and > > when its deallocated it returns it to the cache. > > ... > > Another optimization we should consider that might help a lot in the > > same > > situations where this would help: for code called from the cpython eval > loop, it's afaict possible to determine which inputs are temporaries by > checking their refcnt. In the second call to __add__ in '(a + b) + c', > > the > > temporary will have refcnt 1, while the other arrays will all have > > refcnt > > 1. In such cases (subject to various sanity checks on shape, dtype, > > etc) we > > could elide temporaries by reusing the input array for the output. The > > risk > > is that there may be some code out there that calls these operations > directly from C with non-temp arrays that nonetheless have refcnt 1, > > but we > > should at least investigate the feasibility. E.g. maybe we can do the > optimization for tp_add but not PyArray_Add. > > this seems to be a really good idea, I experimented a bit and it solves > the temporary problem for this types of arithmetic nicely. Its simple to > implement, just change to inplace in array_{add,sub,mul,div} handlers for > the python slots. Doing so does not fail numpy, scipy and pandas > testsuite so it seems save. Performance wise, besides the simple page > zeroing limited benchmarks (a+b+c), it also it brings the laplace out of > place benchmark to the same speed as the inplace benchmark [0]. This is > very nice as the inplace variant is significantly harder to read. > > Sweet. > > Does anyone see any issue we might be overlooking in this refcount == 1 > optimization for the python api? I'll post a PR with the change shortly. > > It occurs belatedly that Cython code like a = np.arange(10) > b = np.arange(10) > c = a + b might end up calling tp_add with refcnt 1 arrays. Ditto for > same with cdef np.ndarray or cdef object added. We should check... > > -n > > _______________________________________________ NumPy-Discussion mailing list > NumPy-Discussion@scipy.org <a > href="http://mail.scipy.org/mailman/listinfo/numpy-discussion">http://mail.scipy.org/mailman/listinfo/numpy-discussion</a> _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion