A Tuesday 20 January 2009, Andrew Collette escrigué: > Hi Francesc, > > Looks like a cool project! However, I'm not able to achieve the > advertised speed-ups. I wrote a simple script to try three > approaches to this kind of problem: > > 1) Native Python code (i.e. will try to do everything at once using > temp arrays) 2) Straightforward numexpr evaluation > 3) Simple "chunked" evaluation using array.flat views. (This solves > the memory problem and allows the use of arbitrary Python > expressions). > > I've attached the script; here's the output for the expression > "63 + (a*b) + (c**2) + sin(b)" > along with a few combinations of shapes/dtypes. As expected, using > anything other than "f8" (double) results in a performance penalty. > Surprisingly, it seems that using chunks via array.flat results in > similar performance for f8, and even better performance for other > dtypes. [clip]
Well, there were two issues there. The first one is that when transcendental functions are used (like sin() above), the bottleneck is on the CPU instead of memory bandwidth, so numexpr speedups are not so high as usual. The other issue was an actual bug in the numexpr code that forced a copy of all multidimensional arrays (I normally only use undimensional arrays for doing benchmarks). This has been fixed in trunk (r39). So, with the fix on, the timings are: (100, 100, 100) f4 (average of 10 runs) Simple: 0.0426136016846 Numexpr: 0.11350851059 Chunked: 0.0635252952576 (100, 100, 100) f8 (average of 10 runs) Simple: 0.119254398346 Numexpr: 0.10092959404 Chunked: 0.128384995461 The speed-up is now a mere 20% (for f8), but at least it is not slower. With the patches that recently contributed Georg for using Intel's VML, the acceleration is a bit better: (100, 100, 100) f4 (average of 10 runs) Simple: 0.0417867898941 Numexpr: 0.0944641113281 Chunked: 0.0636183023453 (100, 100, 100) f8 (average of 10 runs) Simple: 0.120059680939 Numexpr: 0.0832288980484 Chunked: 0.128114104271 i.e. the speed-up is around 45% (for f8). Moreover, if I get rid of the sin() function and use the expresion: "63 + (a*b) + (c**2) + b" I get: (100, 100, 100) f4 (average of 10 runs) Simple: 0.0119329929352 Numexpr: 0.0198570966721 Chunked: 0.0338240146637 (100, 100, 100) f8 (average of 10 runs) Simple: 0.0255623102188 Numexpr: 0.00832500457764 Chunked: 0.0340095996857 which has a 3.1x speedup (for f8). > FYI, the current tar file (1.1-1) has a glitch related to the VERSION > file; I added to the bug report at google code. Thanks. Will focus on that asap. Mmm, seems like there is stuff enough for another release of numexpr. I'll try to do it soon. Cheers, -- Francesc Alted _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion