Francesc Alted schrieb:
> Numexpr is a fast numerical expression evaluator for NumPy. With it,
> expressions that operate on arrays (like "3*a+4*b") are accelerated
> and use less memory than doing the same calculation in Python.
> The expected speed-ups for Numexpr respect to NumPy are between 0.95x
> and 15x, being 3x or 4x typical values. The strided and unaligned
> case has been optimized too, so if the expresion contains such arrays,
> the speed-up can increase significantly. Of course, you will need to
> operate with large arrays (typically larger than the cache size of your
> CPU) to see these improvements in performance.
Just recently I had a more detailed look at numexpr. Clever idea, easy
to use! I can affirm an typical performance gain of 3x if you work on
large arrays (>100k entries).
I also gave a try to the vector math library (VML), contained in Intel's
Math Kernel Library. This offers a fast implementation of mathematical
functions, operating on array. First I implemented a C extension,
providing new ufuncs. This gave me a big performance gain, e.g., 2.3x
(5x) for sin, 6x (10x) for exp, 7x (15x) for pow, and 3x (6x) for
division (no gain for add, sub, mul). The values in parantheses are
given if I allow VML to use several threads and to employ both cores of
my Intel Core2Duo computer. For large arrays (100M entries) this
performance gain is reduced because of limited memory bandwidth. At this
point I was stumbling across numexpr and modified it to use the VML
functions. For sufficiently long and complex numerical expressions I
could get the maximum performance also for large arrays. Together with
VML numexpr seems to be a extremely powerful to get an optimum
performance. I would like to see numexpr extended to (optionally) make
use of fast vectorized math functions. There is one but: VML supports
(at the moment) only math on contiguous arrays. At a first try I didn't
understand how to enforce this limitation in numexpr. I also gave a
quick try to the equivalent vector math library, acml_mv of AMD. I only
tried sin and log, gave me the same performance (on a Intel processor!)
like Intels VML .
I was also playing around with the block size in numexpr. What are the
rationale that led to the current block size of 128? Especially with
VML, a larger block size of 4096 instead of 128 allowed to efficiently
use multithreading in VML.
> Share your experience
> Let us know of any bugs, suggestions, gripes, kudos, etc. you may
I was missing the support for single precision floats.
Numpy-discussion mailing list