On Wed, Aug 18, 2010 at 02:06:51AM +0100, Francesc Alted wrote: > Hey Zbyszek, > > 2010/8/17, Zbyszek Szmek <[email protected]>: > > Hi, > > this is a problem which came up when trying to replace a hand-written > > array concatenation with a call to numpy.vstack: > > for some array sizes, > > > > numpy.vstack(data) > > > > runs > 20% longer than a loop like > > > > alldata = numpy.empty((tlen, dim)) > > for x in data: > > step = x.shape[0] > > alldata[pos:pos+step] = x > > pos += step > > > > (example script attached) > [clip] > > I was curious on what is happening here, so after some profiling with > cachegrind, I've come to the conclusion that `numpy.concatenate` is > using the `memcpy` system call so as to copy data from sources to > recipient. On his hand, your `concat` function is making use of the > `__setitem__` method of ndarray, which does not use `memcpy` (this is > probably due to the fact that it has to deal with strides).
Hey Francesc, thank you for your detailed answer. It seems that memcpy which should always be faster then memmove is sometimes slower! What happens is that using the slice assignment calls memmove() which calls _wordcopy_fwd_aligned() [1] which is apparently faster than memcpy() [2] [1] http://www.eglibc.org/cgi-bin/viewcvs.cgi/trunk/libc/string/wordcopy.c?rev=77&view=auto [2] http://www.eglibc.org/cgi-bin/viewcvs.cgi/trunk/libc/sysdeps/x86_64/memcpy.S?rev=11186&view=markup I guess that you're not seeing the difference because I'm using an amd64 specific memcpy written in assembly, and you're using a i586 implementation. I've tried to reproduce the problem in a C program, but there the memcpy is always much faster than memmove, as should be. I've verified that the difference between memcpy and memmove is the problem by patching array_concatenate to always use memmove: diff --git a/numpy/core/src/multiarray/multiarraymodule.c b/numpy/core/src/multiarray/multia index de63f33..e7f8643 100644 --- a/numpy/core/src/multiarray/multiarraymodule.c +++ b/numpy/core/src/multiarray/multiarraymodule.c @@ -437,7 +437,7 @@ PyArray_Concatenate(PyObject *op, int axis) data = ret->data; for (i = 0; i < n; i++) { numbytes = PyArray_NBYTES(mps[i]); - memcpy(data, mps[i]->data, numbytes); + memmove(data, mps[i]->data, numbytes); data += numbytes; } which gives a speedup the same as using the slice assignment: zbys...@ameba ~/mdp/tmp % python2.6 del_cum3.py numpy 10000 1000 10 10 problem size: (10000x1000) x 10 = 10^8 0.814s <----- without the patch zbys...@ameba ~/mdp/tmp % PYTHONPATH=/var/tmp/install/lib/python2.6/site-packages python2.6 del_cum3.py numpy 10000 1000 10 10 problem size: (10000x1000) x 10 = 10^8 0.637s <----- with the stupid patch > Now, it turns out that `memcpy` may be not optimal for every platform, > and a direct fetch and assign approach could be sometimes faster. My > guess is that this is what is happening in your case. On my machine, > running latest Ubuntu Linux, I'm not seeing this difference though: > > fal...@ubuntu:~/carray$ python bench/concat.py numpy 1000 1000 10 3 > problem size: (1000x1000) x 10 = 10^7 > 0.247s > fal...@ubuntu:~/carray$ python bench/concat.py concat 1000 1000 10 3 > problem size: (1000x1000) x 10 = 10^7 > 0.246s > > and neither when running Windows (XP): > > C:\tmp>python del_cum3.py numpy 10000 1000 1 10 > problem size: (10000x1000) x 1 = 10^7 > 0.227s > > C:\tmp>python del_cum3.py concat 10000 1000 1 10 > problem size: (10000x1000) x 1 = 10^7 > 0.223s Probably the architecture (and thus glibc implementation) is more important than the operating system. But the problem is very much dependent on the size of the arrays, so probably on aligment and other details N.B. with the the parameters above I get 0.081s vs. 0.066s. > Now the new method (carray) with compression level 1 (note the new > parameter at the end of the command line): > > fal...@ubuntu:~/carray$ PYTHONPATH=. python bench/concat.py carray > 1000000 10 3 1 > problem size: (1000000) x 10 = 10^7 > time for concat: 0.186s > size of the final container: 5.076 MB This looks very interesting! Do you think it would be possible to automatically 'guess' if such compression makes sense and just use it behind the scenes as 'decompress-on-write'? I'll try to do some benchmarking tomorrow... Thanks, Zbyszek _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
