On Mon, Mar 17, 2008 at 1:59 PM, Gnata Xavier <[EMAIL PROTECTED]> wrote:
> Francesc Altet wrote: > > A Monday 17 March 2008, Christopher Barker escrigué: > > > >>> > Plus a certain amount of numpy code depends on order of > >>> > evaluation: > >>> > > >>> > a[:-1] = 2*a[1:] > >>> > >> I'm confused here. My understanding of how it now works is that the > >> above translates to: > >> > >> 1) create a new array (call it temp1) from a[1:], which shares a's > >> data block. > >> 2) create a temp2 array by multiplying 2 times each of the elements > >> in temp1, and writing them into a new array, with a new data block 3) > >> copy that temporary array into a[:-1] > >> > >> Why couldn't step (2) be parallelized? Why isn't it already with, > >> BLAS? Doesn't BLAS must have such simple routines? > >> > > > > Probably yes, but the problem is that this kind of operations, namely, > > vector-to-vector (usually found in the BLAS1 subset of BLAS), are > > normally memory-bounded, so you can take little avantage from using > > BLAS, most specially in modern processors, where the gap between the > > CPU throughput and the memory bandwith is quite high (and increasing). > > In modern machines, the use of BLAS is more interesting in vector-matrix > > (BLAS2) computations, but definitely is in matrix-matrix (BLAS3) ones > > (which is where the oportunities for cache reuse is higher) where the > > speedups can really be very good. > > > > > >> Also, maybe numexpr could benefit from this? > >> > > > > Maybe, but unfortunately it wouldn't be able to achieve high speedups. > > Right now, numexpr is focused in accelerating mainly vector-vector > > operations (or matrix-matrix, but element-wise, much like NumPy, so > > that the cache cannot be reused), with some smart optimizations for > > strided and unaligned arrays (in this scenario, it can be 2x or 3x > > faster than NumPy, even for very simple operations like 'a+b'). > > > > In a similar way, OpenMP (or whatever parallel paradigm) will only > > generally be useful when you have to deal with lots of data, and your > > algorithm can have the oportunity to structure it so that small > > portions of them can be reused many times. > > > > Cheers, > > > > > > Well, linear alagera is another topic. > > What I can see from IDL (for innstance) is that it provides the user > with a TOTAL function which take avantage of several CPU when the > number of elements is large. It also provides a very simple way to set a > max number of threads. > > I really really would like to see something like that in numpy (just to > be able to tell somone "switch to numpy it is free and you will get > exactly the same"). For now, I have a problem when they ask for // > functions like TOTAL. > > For now, we can do that using C inline threaded code but it is *complex* > and 2000x2000 images are now common. It is not a corner case any more. > Image processing may be a special in that many cases it is almost embarrassingly parallel. Perhaps some special libraries for that sort of application could be put together and just bits of c code be run on different processors. Not that I know much about parallel processing, but that would be my first take. Chuck
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion