Francesc Alted schrieb: > Numexpr is a fast numerical expression evaluator for NumPy. With it, > expressions that operate on arrays (like "3*a+4*b") are accelerated > and use less memory than doing the same calculation in Python. > > The expected speed-ups for Numexpr respect to NumPy are between 0.95x > and 15x, being 3x or 4x typical values. The strided and unaligned > case has been optimized too, so if the expresion contains such arrays, > the speed-up can increase significantly. Of course, you will need to > operate with large arrays (typically larger than the cache size of your > CPU) to see these improvements in performance. > > Just recently I had a more detailed look at numexpr. Clever idea, easy to use! I can affirm an typical performance gain of 3x if you work on large arrays (>100k entries).

## Advertising

I also gave a try to the vector math library (VML), contained in Intel's Math Kernel Library. This offers a fast implementation of mathematical functions, operating on array. First I implemented a C extension, providing new ufuncs. This gave me a big performance gain, e.g., 2.3x (5x) for sin, 6x (10x) for exp, 7x (15x) for pow, and 3x (6x) for division (no gain for add, sub, mul). The values in parantheses are given if I allow VML to use several threads and to employ both cores of my Intel Core2Duo computer. For large arrays (100M entries) this performance gain is reduced because of limited memory bandwidth. At this point I was stumbling across numexpr and modified it to use the VML functions. For sufficiently long and complex numerical expressions I could get the maximum performance also for large arrays. Together with VML numexpr seems to be a extremely powerful to get an optimum performance. I would like to see numexpr extended to (optionally) make use of fast vectorized math functions. There is one but: VML supports (at the moment) only math on contiguous arrays. At a first try I didn't understand how to enforce this limitation in numexpr. I also gave a quick try to the equivalent vector math library, acml_mv of AMD. I only tried sin and log, gave me the same performance (on a Intel processor!) like Intels VML . I was also playing around with the block size in numexpr. What are the rationale that led to the current block size of 128? Especially with VML, a larger block size of 4096 instead of 128 allowed to efficiently use multithreading in VML. > Share your experience > ===================== > > Let us know of any bugs, suggestions, gripes, kudos, etc. you may > have. > > I was missing the support for single precision floats. Great work! Gregor _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion