Hi, As told, I don't think Theano swap the stride buffer. Most of the time, we allocated with PyArray_empty or zeros. (not sure of the capitals). The only exception I remember have been changed in the last release to use PyArray_NewFromDescr(). Before that, we where allocating the PyArray with the right number of dimensions, then we where manually filling the ptr, shapes and strides. I don't recall any swapping of pointer for shapes and strides in Theano.
So I don't see why Theano would prevent doing just one malloc for the struct and the shapes/strides. If it does, tell me and I'll fix Theano:) I don't want Theano to prevent optimization in NumPy. Theano now support completly the new NumPy C-API interface. Nathaniel also told that resizing the PyArray could prevent that. When Theano call PyArray_resize (not sure of the syntax), we always keep the number of dimensions the same. But I don't know if other code do differently. That could be a reason to keep separate alloc. I don't know any software that manually free the strides/shapes pointer to swap it. So I also think your suggestion to change PyDimMem_NEW to call the small allocator is good. The new interface prevent people from doing that anyway I think. Do we need to wait until we completly remove the old interface for this? Fred On Wed, Jan 8, 2014 at 1:13 PM, Julian Taylor <jtaylor.deb...@googlemail.com> wrote: > On 18.07.2013 15:36, Nathaniel Smith wrote: >> On Wed, Jul 17, 2013 at 5:57 PM, Frédéric Bastien <no...@nouiz.org> wrote: >>> On Wed, Jul 17, 2013 at 10:39 AM, Nathaniel Smith <n...@pobox.com> wrote: >>>>> >>>>> On Tue, Jul 16, 2013 at 11:55 AM, Nathaniel Smith <n...@pobox.com> wrote: >>>> It's entirely possible I misunderstood, so let's see if we can work it >>>> out. I know that you want to assign to the ->data pointer in a >>>> PyArrayObject, right? That's what caused some trouble with the 1.7 API >>>> deprecations, which were trying to prevent direct access to this >>>> field? Creating a new array given a pointer to a memory region is no >>>> problem, and obviously will be supported regardless of any >>>> optimizations. But if that's all you were doing then you shouldn't >>>> have run into the deprecation problem. Or maybe I'm misremembering! >>> >>> What is currently done at only 1 place is to create a new PyArrayObject with >>> a given ptr. So NumPy don't do the allocation. We later change that ptr to >>> another one. >> >> Hmm, OK, so that would still work. If the array has the OWNDATA flag >> set (or you otherwise know where the data came from), then swapping >> the data pointer would still work. >> >> The change would be that in most cases when asking numpy to allocate a >> new array from scratch, the OWNDATA flag would not be set. That's >> because the OWNDATA flag really means "when this object is >> deallocated, call free(self->data)", but if we allocate the array >> struct and the data buffer together in a single memory region, then >> deallocating the object will automatically cause the data buffer to be >> deallocated as well, without the array destructor having to take any >> special effort. >> >>> It is the change to the ptr of the just created PyArrayObject that caused >>> problem with the interface deprecation. I fixed all other problem releated >>> to the deprecation (mostly just rename of function/macro). But I didn't >>> fixed this one yet. I would need to change the logic to compute the final >>> ptr before creating the PyArrayObject object and create it with the final >>> data ptr. But in call cases, NumPy didn't allocated data memory for this >>> object, so this case don't block your optimization. >> >> Right. >> >>> One thing in our optimization "wish list" is to reuse allocated >>> PyArrayObject between Theano function call for intermediate results(so >>> completly under Theano control). This could be useful in particular for >>> reshape/transpose/subtensor. Those functions are pretty fast and from >>> memory, I already found the allocation time was significant. But in those >>> cases, it is on PyArrayObject that are views, so the metadata and the data >>> would be in different memory region in all cases. >>> >>> The other cases of optimization "wish list" is if we want to reuse the >>> PyArrayObject when the shape isn't the good one (but the number of >>> dimensions is the same). If we do that for operation like addition, we will >>> need to use PyArray_Resize(). This will be done on PyArrayObject whose data >>> memory was allocated by NumPy. So if you do one memory allowcation for >>> metadata and data, just make sure that PyArray_Resize() will handle that >>> correctly. >> >> I'm not sure I follow the details here, but it does turn out that a >> really surprising amount of time in PyArray_NewFromDescr is spent in >> just calculating and writing out the shape and strides buffers, so for >> programs that e.g. use hundreds of small 3-element arrays to represent >> points in space, re-using even these buffers might be a big win... >> >>> On the usefulness of doing only 1 memory allocation, on our old gpu ndarray, >>> we where doing 2 alloc on the GPU, one for metadata and one for data. I >>> removed this, as this was a bottleneck. allocation on the CPU are faster the >>> on the GPU, but this is still something that is slow except if you reuse >>> memory. Do PyMem_Malloc, reuse previous small allocation? >> >> Yes, at least in theory PyMem_Malloc is highly-optimized for small >> buffer re-use. (For requests >256 bytes it just calls malloc().) And >> it's possible to define type-specific freelists; not sure if there's >> any value in doing that for PyArrayObjects. See Objects/obmalloc.c in >> the Python source tree. >> >> -n > > PyMem_Malloc is just a wrapper around malloc, so its only as optimized > as the c library is (glibc is not good for small allocations). > PyObject_Malloc uses a small object allocator for requests smaller 512 > bytes (256 in python2). > > I filed a pull request [0] replacing a few functions which I think are > safe to convert to this API. The nditer allocation which is completely > encapsulated and the construction of the scalar and array python objects > which are deleted via the tp_free slot (we really should not support > third party libraries using PyMem_Free on python objects without checks). > > This already gives up to 15% improvements for scalar operations compared > to glibc 2.17 malloc. > Do I understand the discussions here right that we could replace > PyDimMem_NEW which is used for strides in PyArray with the small object > allocation too? > It would still allow swapping the stride buffer, but every application > must then delete it with PyDimMem_FREE which should be a reasonable > requirement. > > [0] https://github.com/numpy/numpy/pull/4177 > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion