2009/10/5 Christopher Barker <chris.bar...@noaa.gov>: > Francesc Alted wrote: >> A Saturday 03 October 2009 10:06:12 Christopher Barker escrigué: >>> This idea was inspired by a discussion at the SciPy conference, in which >>> we spent a LOT of time during the numpy tutorial talking about how to >>> accumulate values in an array when you don't know how big the array >>> needs to be when you start. > >>> What I have in mind is very simple. It would be: >>> - Only 1-d >>> - Support append() and extend() methods >>> - support indexing and slicing >>> - Support any valid numpy dtype >>> - which could even get you pseudo n-d arrays... >>> - maybe it would act like an array in other ways, I'm not so sure. >>> - ufuncs, etc. > >> That's interesting. I'd normally use the `resize()` method for what you >> want, >> but indeed your approach is way more easy-to-use. > > Of course, this is using resize() under the hood, but giving it an > easier interface, but more importantly, it's adding the pre-allocation > for you, and the code to deal with that. I suppose I should benchmark > it, but I think calling resize(0 with every append would be a lot slower > (though maybe not -- might the compiler/os be pre-allocating some extra > memory anyway?)
I looked into this at some point, and under Linux, the malloc doesn't allocate substantial extra memory until you get big enough that it's allocating complete memory pages, at which point you get until the end of the page. At this point it's possible that adding more memory onto the end of the malloced region (and maybe even moving the array around in memory) can become really cheap, since it's just requesting more memory from the OS. Also, a friend who's a bare-metal programming wizard pointed out to me that modern malloc implementations really hate realloc, since it tends to put memory blocks in arenas intended for different sizes. I think that's only really an issue for shrinking blocks, since they probably just always allocate a new block when growing (unless they're in the pages-at-a-time regime). In short, I think it's better to have a python-list-like growing scheme. In fact it's maybe more important for arrays than python lists, since in a python list all that needs to be moved are pointers to the actual python objects, only ever a small fraction of the data volume. > I should profile this -- if you can call resize() with every new item, > and it's not too slow, then it may not be worth writing this class at > all (or I could make it simpler, maybe even an nd-array subclass instead. Keep in mind the need for sensible handling of slices, since the underlying array will probably move on every resize. I think there's a need for this code. >> If you are looking for performance improvements, I'd have a look at the >> `PyArray_Resize()` function in 'core/src/multiarray/shape.c' (trunk). It >> seems to me that the zero-initialization of added memory can be skipped, >> allowing for more performance for the `resize()` method (most specially for >> large size increments). > > I suppose so, but I doubt that's causing any of my performance issues. > Another thing to profile. Probably worth profiling, yes - I wouldn't worry about the time taken writing zeros, but that does mean you have to touch all the allocated memory, which can't be too great for the cache. Anne _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion