2010/8/18, Zbyszek Szmek <[email protected]>: > thank you for your detailed answer. It seems that memcpy which should always > be faster then memmove is sometimes slower! What happens is that > using the slice assignment calls memmove() which calls > _wordcopy_fwd_aligned() [1] > which is apparently faster than memcpy() [2] > > [1] > http://www.eglibc.org/cgi-bin/viewcvs.cgi/trunk/libc/string/wordcopy.c?rev=77&view=auto > [2] > http://www.eglibc.org/cgi-bin/viewcvs.cgi/trunk/libc/sysdeps/x86_64/memcpy.S?rev=11186&view=markup > > I guess that you're not seeing the difference because I'm using an > amd64 specific memcpy written in assembly, and you're using a i586 > implementation. I've tried to reproduce the problem in a C program, > but there the memcpy is always much faster than memmove, as should be. > > I've verified that the difference between memcpy and memmove is the > problem by patching array_concatenate to always use memmove: > diff --git a/numpy/core/src/multiarray/multiarraymodule.c > b/numpy/core/src/multiarray/multia > index de63f33..e7f8643 100644 > --- a/numpy/core/src/multiarray/multiarraymodule.c > +++ b/numpy/core/src/multiarray/multiarraymodule.c > @@ -437,7 +437,7 @@ PyArray_Concatenate(PyObject *op, int axis) > data = ret->data; > for (i = 0; i < n; i++) { > numbytes = PyArray_NBYTES(mps[i]); > - memcpy(data, mps[i]->data, numbytes); > + memmove(data, mps[i]->data, numbytes); > data += numbytes; > } > > which gives a speedup the same as using the slice assignment: > zbys...@ameba ~/mdp/tmp % python2.6 del_cum3.py numpy 10000 1000 10 10 > problem size: (10000x1000) x 10 = 10^8 > 0.814s <----- without the patch > > zbys...@ameba ~/mdp/tmp % > PYTHONPATH=/var/tmp/install/lib/python2.6/site-packages python2.6 > del_cum3.py numpy 10000 1000 10 10 > problem size: (10000x1000) x 10 = 10^8 > 0.637s <----- with the stupid patch
Ok. So it is pretty clear that the flaw is a bad performance of memcpy on your platform. If you can confirm this, then would be nice if you can report that to the memcpy mantainer for the glibc project. > Probably the architecture (and thus glibc implementation) is more > important than the operating system. But the problem is very much > dependent on the size of the arrays, so probably on aligment and other > details Yes. But if memmove is faster than memcpy, then I'd say that something is wrong with memcpy. Another possibility is that the malloc in `numpy.concatenate` is different than the malloc in `numpy.empty`, and that they return memory blocks with different alignments; that could explain the difference in performance too (although this possibility is remote, IMO). >> Now the new method (carray) with compression level 1 (note the new >> parameter at the end of the command line): >> >> fal...@ubuntu:~/carray$ PYTHONPATH=. python bench/concat.py carray >> 1000000 10 3 1 >> problem size: (1000000) x 10 = 10^7 >> time for concat: 0.186s >> size of the final container: 5.076 MB > > This looks very interesting! Do you think it would be possible to > automatically 'guess' if such compression makes sense and just use > it behind the scenes as 'decompress-on-write'? I'll try to do some > benchmarking tomorrow... I'd say that, on relatively new processors (i.e. processors with around 3 MB of cache and a couple of cores or more), carray would be in general faster than a pure ndarray approach for most of cases. But indeed, benchmarking is the best way to tell. Cheers, -- Francesc Alted _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
