Christopher Barker wrote: > OK -- this one I'm intending to send! > > Hi all, > > This idea was inspired by a discussion at the SciPy conference, in which > we spent a LOT of time during the numpy tutorial talking about how to > accumulate values in an array when you don't know how big the array > needs to be when you start. > > The "standard practice" is to accumulate in a python list, then convert > the final result into an array. This is a good idea because Python lists > are standard, well tested, efficient, etc. > > However, as was pointed out in that lengthy discussion, if what you are > doing is accumulating is a whole bunch of numbers (ints, floats, > whatever), or particularly if you need to accumulate a data type that > plain python doesn't support, there is a lot of overhead involved: a > python float type is pretty heavyweight. If performance or memory use is > important, it might create issues. You can use and array.array, but it > doesn't support all numpy types, particularly custom dtypes. > > I talked about this on the cython list (as someone asked how to do > accumulate in cython), and a few folks thought it would be useful, so I > put together a prototype. > > What I have in mind is very simple. It would be: > - Only 1-d > - Support append() and extend() methods > - support indexing and slicing > - Support any valid numpy dtype > - which could even get you pseudo n-d arrays... > - maybe it would act like an array in other ways, I'm not so sure. > - ufuncs, etc. > > It could take the place of using python lists/arrays when you really > want a numpy array, but don't know how big it will be until you've > filled it. > > The implementation I have now uses a regular numpy array as the > "buffer". The buffer is re-sized as needed with ndarray.resize(). I've > enclosed the class, a bunch of tests (This is the first time I've ever > really done test-driven development, though I wouldn't say that this is > a complete test suite). > > A few notes about this implementation: > > * the name of the class could be better, and so could some of the > method names. > > * on further thought, I think it could handle n-d arrays, as long as > you only accumulated along the first index. > > * It could use a bunch more methods > - deleting part of the array > - math > - probably anything supported by array.array would be good. > > * Robert pointed me to the array.array implementation to see how it > expands the buffer as you append. It did tricks to get it to grow fast > when the array is very small, then eventually to add about 1/16 of the > used array size to the buffer. I imagine that this would gets used > because you were likely to have a big array, so I didn't bother and > start with a buffer at 128 elements, then add 1/4 each time you need to > expand -- these are both tweakable attributes. > > * I'm keeping the buffer a hidden variable, and slicing and __array__ > return copies - this is so that it won't get multiple references, and > then not be expandable. > > * I did a little simple profiling, and discovered that it's slower > than a python list by a factor of more than 2 (for accumulating python > ints, anyway). With a bit of experimentation, I think that's because of > a couple factors: > - an extra function call -- the append() method needs to then do an > assignment to the buffer > - Object conversion -- python lists store python objects, so the > python int can just go right in there. with numpy, it needs to be > converted to a C int first -- a bit if extra overhead. Though a straight > assignment into a pre-allocated array i faster than a list. > > I think it's still an improvement for memory use. > > Maybe it would be worth writing in C or Cython to avoid some of this. In > particular, it would be nice if you could use it in Cython, and put C > types directly it... > > * This could be pretty useful for things like genfromtxt. > > What do folks think? is this useful? What would you change, etc?
I'd drop the __getslice__ as it is deprecated (in Python 3 it is removed). Slices will be passed as "slice" objects to __getitem__ if you don't provide __getslice__. One could support myaccumulator[[1,2,3]] as well in __getitem__, although I guess it gets a little hairy as you must seek through the array-like object passed and see to it that no values are too large. -- Dag Sverre _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion