Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator
A Tuesday 20 January 2009, Andrew Collette escrigué: Works much, much better with the current svn version. :) Numexpr now outperforms everything except the simple technique, and then only for small data sets. Correct. This is because of the cost of parsing the expression and initializing the virtual machine. However, as soon as the sizes of the operands exceeds the cache of your processor you are starting to see the improvement in performance. Along the lines you mentioned I noticed that simply changing from a shape of (100*100*100,) to (100, 100, 100) results in nearly a factor of 2 worse performance, a factor which seems constant when changing the size of the data set. Sorry, but I cannot reproduce this. When using the expression: 63 + (a*b) + (c**2) + b I get on my machine (co...@3 GHz, running openSUSE Linux 11.1): 100 f8 (average of 10 runs) Simple: 0.0278068065643 Numexpr: 0.00839750766754 Chunked: 0.0266514062881 (100, 100, 100) f8 (average of 10 runs) Simple: 0.0277318000793 Numexpr: 0.00851640701294 Chunked: 0.0346593856812 and these are the expected results (i.e. no change in performance due to multidimensional arrays). Even for larger arrays, I don't see nothing unexpected: 1000 f8 (average of 10 runs) Simple: 0.334054994583 Numexpr: 0.110022115707 Chunked: 0.29678030014 (100, 100, 100, 10) f8 (average of 10 runs) Simple: 0.339299607277 Numexpr: 0.111632704735 Chunked: 0.375299096107 Can you tell us which platforms are you using? Is this related to the way numexpr handles broadcasting rules? It would seem the memory contents should be identical for these two cases. Andrew On Tue, Jan 20, 2009 at 6:13 AM, Francesc Alted fal...@pytables.org wrote: A Tuesday 20 January 2009, Andrew Collette escrigué: Hi Francesc, Looks like a cool project! However, I'm not able to achieve the advertised speed-ups. I wrote a simple script to try three approaches to this kind of problem: 1) Native Python code (i.e. will try to do everything at once using temp arrays) 2) Straightforward numexpr evaluation 3) Simple chunked evaluation using array.flat views. (This solves the memory problem and allows the use of arbitrary Python expressions). I've attached the script; here's the output for the expression 63 + (a*b) + (c**2) + sin(b) along with a few combinations of shapes/dtypes. As expected, using anything other than f8 (double) results in a performance penalty. Surprisingly, it seems that using chunks via array.flat results in similar performance for f8, and even better performance for other dtypes. [clip] Well, there were two issues there. The first one is that when transcendental functions are used (like sin() above), the bottleneck is on the CPU instead of memory bandwidth, so numexpr speedups are not so high as usual. The other issue was an actual bug in the numexpr code that forced a copy of all multidimensional arrays (I normally only use undimensional arrays for doing benchmarks). This has been fixed in trunk (r39). So, with the fix on, the timings are: (100, 100, 100) f4 (average of 10 runs) Simple: 0.0426136016846 Numexpr: 0.11350851059 Chunked: 0.0635252952576 (100, 100, 100) f8 (average of 10 runs) Simple: 0.119254398346 Numexpr: 0.10092959404 Chunked: 0.128384995461 The speed-up is now a mere 20% (for f8), but at least it is not slower. With the patches that recently contributed Georg for using Intel's VML, the acceleration is a bit better: (100, 100, 100) f4 (average of 10 runs) Simple: 0.0417867898941 Numexpr: 0.0944641113281 Chunked: 0.0636183023453 (100, 100, 100) f8 (average of 10 runs) Simple: 0.120059680939 Numexpr: 0.0832288980484 Chunked: 0.128114104271 i.e. the speed-up is around 45% (for f8). Moreover, if I get rid of the sin() function and use the expresion: 63 + (a*b) + (c**2) + b I get: (100, 100, 100) f4 (average of 10 runs) Simple: 0.0119329929352 Numexpr: 0.0198570966721 Chunked: 0.0338240146637 (100, 100, 100) f8 (average of 10 runs) Simple: 0.0255623102188 Numexpr: 0.00832500457764 Chunked: 0.0340095996857 which has a 3.1x speedup (for f8). FYI, the current tar file (1.1-1) has a glitch related to the VERSION file; I added to the bug report at google code. Thanks. Will focus on that asap. Mmm, seems like there is stuff enough for another release of numexpr. I'll try to do it soon. Cheers, -- Francesc Alted ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion -- Francesc Alted ___ Numpy-discussion mailing
Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator
Hi, I get identical results for both shapes now; I manually removed the numexpr-1.1.1.dev-py2.5-linux-i686.egg folder in site-packages and reinstalled. I suppose there must have been a stale set of files somewhere. Andrew Collette On Wed, Jan 21, 2009 at 3:41 AM, Francesc Alted fal...@pytables.org wrote: A Tuesday 20 January 2009, Andrew Collette escrigué: Works much, much better with the current svn version. :) Numexpr now outperforms everything except the simple technique, and then only for small data sets. Correct. This is because of the cost of parsing the expression and initializing the virtual machine. However, as soon as the sizes of the operands exceeds the cache of your processor you are starting to see the improvement in performance. Along the lines you mentioned I noticed that simply changing from a shape of (100*100*100,) to (100, 100, 100) results in nearly a factor of 2 worse performance, a factor which seems constant when changing the size of the data set. Sorry, but I cannot reproduce this. When using the expression: 63 + (a*b) + (c**2) + b I get on my machine (co...@3 GHz, running openSUSE Linux 11.1): 100 f8 (average of 10 runs) Simple: 0.0278068065643 Numexpr: 0.00839750766754 Chunked: 0.0266514062881 (100, 100, 100) f8 (average of 10 runs) Simple: 0.0277318000793 Numexpr: 0.00851640701294 Chunked: 0.0346593856812 and these are the expected results (i.e. no change in performance due to multidimensional arrays). Even for larger arrays, I don't see nothing unexpected: 1000 f8 (average of 10 runs) Simple: 0.334054994583 Numexpr: 0.110022115707 Chunked: 0.29678030014 (100, 100, 100, 10) f8 (average of 10 runs) Simple: 0.339299607277 Numexpr: 0.111632704735 Chunked: 0.375299096107 Can you tell us which platforms are you using? Is this related to the way numexpr handles broadcasting rules? It would seem the memory contents should be identical for these two cases. Andrew On Tue, Jan 20, 2009 at 6:13 AM, Francesc Alted fal...@pytables.org wrote: A Tuesday 20 January 2009, Andrew Collette escrigué: Hi Francesc, Looks like a cool project! However, I'm not able to achieve the advertised speed-ups. I wrote a simple script to try three approaches to this kind of problem: 1) Native Python code (i.e. will try to do everything at once using temp arrays) 2) Straightforward numexpr evaluation 3) Simple chunked evaluation using array.flat views. (This solves the memory problem and allows the use of arbitrary Python expressions). I've attached the script; here's the output for the expression 63 + (a*b) + (c**2) + sin(b) along with a few combinations of shapes/dtypes. As expected, using anything other than f8 (double) results in a performance penalty. Surprisingly, it seems that using chunks via array.flat results in similar performance for f8, and even better performance for other dtypes. [clip] Well, there were two issues there. The first one is that when transcendental functions are used (like sin() above), the bottleneck is on the CPU instead of memory bandwidth, so numexpr speedups are not so high as usual. The other issue was an actual bug in the numexpr code that forced a copy of all multidimensional arrays (I normally only use undimensional arrays for doing benchmarks). This has been fixed in trunk (r39). So, with the fix on, the timings are: (100, 100, 100) f4 (average of 10 runs) Simple: 0.0426136016846 Numexpr: 0.11350851059 Chunked: 0.0635252952576 (100, 100, 100) f8 (average of 10 runs) Simple: 0.119254398346 Numexpr: 0.10092959404 Chunked: 0.128384995461 The speed-up is now a mere 20% (for f8), but at least it is not slower. With the patches that recently contributed Georg for using Intel's VML, the acceleration is a bit better: (100, 100, 100) f4 (average of 10 runs) Simple: 0.0417867898941 Numexpr: 0.0944641113281 Chunked: 0.0636183023453 (100, 100, 100) f8 (average of 10 runs) Simple: 0.120059680939 Numexpr: 0.0832288980484 Chunked: 0.128114104271 i.e. the speed-up is around 45% (for f8). Moreover, if I get rid of the sin() function and use the expresion: 63 + (a*b) + (c**2) + b I get: (100, 100, 100) f4 (average of 10 runs) Simple: 0.0119329929352 Numexpr: 0.0198570966721 Chunked: 0.0338240146637 (100, 100, 100) f8 (average of 10 runs) Simple: 0.0255623102188 Numexpr: 0.00832500457764 Chunked: 0.0340095996857 which has a 3.1x speedup (for f8). FYI, the current tar file (1.1-1) has a glitch related to the VERSION file; I added to the bug report at google code. Thanks. Will focus on that asap. Mmm, seems like there is stuff enough for another release of numexpr. I'll try to do it soon. Cheers, -- Francesc Alted ___ Numpy-discussion mailing list
Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator
A Tuesday 20 January 2009, Andrew Collette escrigué: Hi Francesc, Looks like a cool project! However, I'm not able to achieve the advertised speed-ups. I wrote a simple script to try three approaches to this kind of problem: 1) Native Python code (i.e. will try to do everything at once using temp arrays) 2) Straightforward numexpr evaluation 3) Simple chunked evaluation using array.flat views. (This solves the memory problem and allows the use of arbitrary Python expressions). I've attached the script; here's the output for the expression 63 + (a*b) + (c**2) + sin(b) along with a few combinations of shapes/dtypes. As expected, using anything other than f8 (double) results in a performance penalty. Surprisingly, it seems that using chunks via array.flat results in similar performance for f8, and even better performance for other dtypes. [clip] Well, there were two issues there. The first one is that when transcendental functions are used (like sin() above), the bottleneck is on the CPU instead of memory bandwidth, so numexpr speedups are not so high as usual. The other issue was an actual bug in the numexpr code that forced a copy of all multidimensional arrays (I normally only use undimensional arrays for doing benchmarks). This has been fixed in trunk (r39). So, with the fix on, the timings are: (100, 100, 100) f4 (average of 10 runs) Simple: 0.0426136016846 Numexpr: 0.11350851059 Chunked: 0.0635252952576 (100, 100, 100) f8 (average of 10 runs) Simple: 0.119254398346 Numexpr: 0.10092959404 Chunked: 0.128384995461 The speed-up is now a mere 20% (for f8), but at least it is not slower. With the patches that recently contributed Georg for using Intel's VML, the acceleration is a bit better: (100, 100, 100) f4 (average of 10 runs) Simple: 0.0417867898941 Numexpr: 0.0944641113281 Chunked: 0.0636183023453 (100, 100, 100) f8 (average of 10 runs) Simple: 0.120059680939 Numexpr: 0.0832288980484 Chunked: 0.128114104271 i.e. the speed-up is around 45% (for f8). Moreover, if I get rid of the sin() function and use the expresion: 63 + (a*b) + (c**2) + b I get: (100, 100, 100) f4 (average of 10 runs) Simple: 0.0119329929352 Numexpr: 0.0198570966721 Chunked: 0.0338240146637 (100, 100, 100) f8 (average of 10 runs) Simple: 0.0255623102188 Numexpr: 0.00832500457764 Chunked: 0.0340095996857 which has a 3.1x speedup (for f8). FYI, the current tar file (1.1-1) has a glitch related to the VERSION file; I added to the bug report at google code. Thanks. Will focus on that asap. Mmm, seems like there is stuff enough for another release of numexpr. I'll try to do it soon. Cheers, -- Francesc Alted ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator
Works much, much better with the current svn version. :) Numexpr now outperforms everything except the simple technique, and then only for small data sets. Along the lines you mentioned I noticed that simply changing from a shape of (100*100*100,) to (100, 100, 100) results in nearly a factor of 2 worse performance, a factor which seems constant when changing the size of the data set. Is this related to the way numexpr handles broadcasting rules? It would seem the memory contents should be identical for these two cases. Andrew On Tue, Jan 20, 2009 at 6:13 AM, Francesc Alted fal...@pytables.org wrote: A Tuesday 20 January 2009, Andrew Collette escrigué: Hi Francesc, Looks like a cool project! However, I'm not able to achieve the advertised speed-ups. I wrote a simple script to try three approaches to this kind of problem: 1) Native Python code (i.e. will try to do everything at once using temp arrays) 2) Straightforward numexpr evaluation 3) Simple chunked evaluation using array.flat views. (This solves the memory problem and allows the use of arbitrary Python expressions). I've attached the script; here's the output for the expression 63 + (a*b) + (c**2) + sin(b) along with a few combinations of shapes/dtypes. As expected, using anything other than f8 (double) results in a performance penalty. Surprisingly, it seems that using chunks via array.flat results in similar performance for f8, and even better performance for other dtypes. [clip] Well, there were two issues there. The first one is that when transcendental functions are used (like sin() above), the bottleneck is on the CPU instead of memory bandwidth, so numexpr speedups are not so high as usual. The other issue was an actual bug in the numexpr code that forced a copy of all multidimensional arrays (I normally only use undimensional arrays for doing benchmarks). This has been fixed in trunk (r39). So, with the fix on, the timings are: (100, 100, 100) f4 (average of 10 runs) Simple: 0.0426136016846 Numexpr: 0.11350851059 Chunked: 0.0635252952576 (100, 100, 100) f8 (average of 10 runs) Simple: 0.119254398346 Numexpr: 0.10092959404 Chunked: 0.128384995461 The speed-up is now a mere 20% (for f8), but at least it is not slower. With the patches that recently contributed Georg for using Intel's VML, the acceleration is a bit better: (100, 100, 100) f4 (average of 10 runs) Simple: 0.0417867898941 Numexpr: 0.0944641113281 Chunked: 0.0636183023453 (100, 100, 100) f8 (average of 10 runs) Simple: 0.120059680939 Numexpr: 0.0832288980484 Chunked: 0.128114104271 i.e. the speed-up is around 45% (for f8). Moreover, if I get rid of the sin() function and use the expresion: 63 + (a*b) + (c**2) + b I get: (100, 100, 100) f4 (average of 10 runs) Simple: 0.0119329929352 Numexpr: 0.0198570966721 Chunked: 0.0338240146637 (100, 100, 100) f8 (average of 10 runs) Simple: 0.0255623102188 Numexpr: 0.00832500457764 Chunked: 0.0340095996857 which has a 3.1x speedup (for f8). FYI, the current tar file (1.1-1) has a glitch related to the VERSION file; I added to the bug report at google code. Thanks. Will focus on that asap. Mmm, seems like there is stuff enough for another release of numexpr. I'll try to do it soon. Cheers, -- Francesc Alted ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator
Thanks! I think this will help the package attract a lot of users. A couple of housekeeping things: on http://code.google.com/p/numexpr: What it is? - What is it? or What it is (no question mark) on http://code.google.com/p/numexpr/wiki/Overview: The last example got incorporated as straight text somehow. In firefox, the first code example runs into the pastel boxes on the right for modest-width browsers. This is a common problem with firefox, but I think it comes from improper HTML code that IE somehow deals with, rather than non-standard behavior in firefox. One thing I'd add is a benchmark example against numpy. Make it simple, so that people can copy and modify the benchmark code to test their own performance improvements. I added an entry for it on the Topical Software list. Please check it out and modify as you see fit --jh-- ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator
Hi Francesc, Looks like a cool project! However, I'm not able to achieve the advertised speed-ups. I wrote a simple script to try three approaches to this kind of problem: 1) Native Python code (i.e. will try to do everything at once using temp arrays) 2) Straightforward numexpr evaluation 3) Simple chunked evaluation using array.flat views. (This solves the memory problem and allows the use of arbitrary Python expressions). I've attached the script; here's the output for the expression 63 + (a*b) + (c**2) + sin(b) along with a few combinations of shapes/dtypes. As expected, using anything other than f8 (double) results in a performance penalty. Surprisingly, it seems that using chunks via array.flat results in similar performance for f8, and even better performance for other dtypes. (100, 100, 100) f4 (average of 10 runs) Simple: 0.155238199234 Numexpr: 0.278440499306 Chunked: 0.166213512421 (100, 100, 100) f8 (average of 10 runs) Simple: 0.241649699211 Numexpr: 0.192837905884 Chunked: 0.183888602257 (100, 100, 100, 10) f4 (average of 10 runs) Simple: 1.56741549969 Numexpr: 3.40679829121 Chunked: 1.83729870319 (100, 100, 100) i4 (average of 10 runs) Simple: 0.206279683113 Numexpr: 0.210431909561 Chunked: 0.182894086838 FYI, the current tar file (1.1-1) has a glitch related to the VERSION file; I added to the bug report at google code. Andrew Collette On Fri, Jan 16, 2009 at 4:00 AM, Francesc Alted fal...@pytables.org wrote: Announcing Numexpr 1.1 Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like 3*a+4*b) are accelerated and use less memory than doing the same calculation in Python. The expected speed-ups for Numexpr respect to NumPy are between 0.95x and 15x, being 3x or 4x typical values. The strided and unaligned case has been optimized too, so if the expresion contains such arrays, the speed-up can increase significantly. Of course, you will need to operate with large arrays (typically larger than the cache size of your CPU) to see these improvements in performance. This release is mainly intended to put in sync some of the improvements that had the Numexpr version integrated in PyTables. So, this standalone version of Numexpr will benefit of the well tested PyTables' version that has been in production for more than a year now. In case you want to know more in detail what has changed in this version, have a look at ``RELEASE_NOTES.txt`` in the tarball. Where I can find Numexpr? = The project is hosted at Google code in: http://code.google.com/p/numexpr/ Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. Enjoy! -- Francesc Alted ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion import numpy as np import numexpr as nx import time test_shape = (100,100,100) # All 3 arrays have this shape test_dtype = 'i4' nruns = 10 # Ensemble for timing test_size = np.product(test_shape) def chunkify(chunksize): Very stupid chunk vectorizer which keeps memory use down. This version requires all inputs to have the same number of elements, although it shouldn't be that hard to implement simple broadcasting. def chunkifier(func): def wrap(*args): assert len(args) 0 assert all(len(a.flat) == len(args[0].flat) for a in args) nelements = len(args[0].flat) nchunks, remain = divmod(nelements, chunksize) out = np.ndarray(args[0].shape) for start in xrange(0, nelements, chunksize): #print start stop = start+chunksize if start+chunksize nelements: stop = nelements-start iargs = tuple(a.flat[start:stop] for a in args) out.flat[start:stop] = func(*iargs) return out return wrap return chunkifier test_func_str = 63 + (a*b) + (c**2) + sin(b) def test_func(a, b, c): return 63 + (a*b) + (c**2) + np.sin(b) test_func_chunked = chunkify(100*100)(test_func) # The actual data we'll use a = np.arange(test_size, dtype=test_dtype).reshape(test_shape) b = np.arange(test_size, dtype=test_dtype).reshape(test_shape) c = np.arange(test_size, dtype=test_dtype).reshape(test_shape) start1 = time.time() for idx in xrange(nruns): result1 = test_func(a, b, c) stop1 = time.time() start2 = time.time() for idx in xrange(nruns): result2 = nx.evaluate(test_func_str) stop2 = time.time() start3 = time.time() for idx in xrange(nruns): result3 = test_func_chunked(a, b, c) stop3 = time.time() print %s %s (average of %s runs) % (test_shape, test_dtype, nruns) print Simple: ,
Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator
Francesc Alted wrote: Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like 3*a+4*b) are accelerated and use less memory than doing the same calculation in Python. Please pardon my ignorance as I know this project has been around for a while. It this looks very exciting, but either it's cumbersome, or I'm not understanding exactly what's being fixed. If you can accelerate evaluation, why not just integrate the faster math into numpy, rather than having two packages? Or is this something that is only an advantage when the expression is given as a string (and why is that the case)? It would be helpful if you could put the answer on your web page and in your standard release blurb in some compact form. I guess what I'm really looking for when I read one of those is a quick answer to the question should I look into this?. Well, there is a link in the project page to the Overview section of the wiki, but perhaps is a bit hidden. I've added some blurb as you suggested in the main page an another link to the Overview wiki page. Hope that, by reading the new blurb, you can see why it accelerates expression evaluation with regard to NumPy. If not, tell me and will try to come with something more comprehensible. I did see the overview. The addition you made is great but it's so far down that many won't get to it. Even in its section, the meat of it is below three paragraphs that most users won't care about and many won't understand. I've posted some notes on writing intros in Developer_Zone. In the following, I've reordered the page to address the questions of potential users first, edited it a bit, and fixed the example to conform to our doc standards (and 128-256; hope that was right). See what you think... ** Description: The numexpr package evaluates multiple-operator array expressions many times faster than numpy can. It accepts the expression as a string, analyzes it, rewrites it more efficiently, and compiles it to faster Python code on the fly. It's the next best thing to writing the expression in C and compiling it with an optimizing compiler (as scipy.weave does), but requires no compiler at runtime. Using it is simple: import numpy as np import numexpr as ne a = np.arange(10) b = np.arange(0, 20, 2) c = ne.evaluate(2*a+3*b) c array([ 0, 8, 16, 24, 32, 40, 48, 56, 64, 72]) ** Why does it work? There are two extremes to array expression evaluation. Each binary operation can run separately over the array elements and return a temporary array. This is what NumPy does: 2*a + 3*b uses three temporary arrays as large as a or b. This strategy wastes memory (a problem if the arrays are large). It is also not a good use of CPU cache memory because the results of 2*a and 3*b will not be in cache for the final addition if the arrays are large. The other extreme is to loop over each element: for i in xrange(len(a)): c[i] = 2*a[i] + 3*b[i] This conserves memory and is good for the cache, but on each iteration Python must check the type of each operand and select the correct routine for each operation. All but the first such checks are wasted, as the input arrays are not changing. numexpr uses an in-between approach. Arrays are handled in chunks (the first pass uses 256 elements). As Python code, it looks something like this: for i in xrange(0, len(a), 256): r0 = a[i:i+256] r1 = b[i:i+256] multiply(r0, 2, r2) multiply(r1, 3, r3) add(r2, r3, r2) c[i:i+256] = r2 The 3-argument form of add() stores the result in the third argument, instead of allocating a new array. This achieves a good balance between cache and branch prediction. The virtual machine is written entirely in C, which makes it faster than the Python above. ** Supported Operators (unchanged) ** Supported Functions (unchanged, but capitalize 'F') ** Usage Notes (no need to repeat the example) Numexpr's principal routine is: evaluate(ex, local_dict=None, global_dict=None, **kwargs) ex is a string forming an expression, like 2*a+3*b. The values for a and b will by default be taken from the calling function's frame (through the use of sys._getframe()). Alternatively, they can be specified using the local_dict or global_dict` arguments, or passed as keyword arguments. Expressions are cached, so reuse is fast. Arrays or scalars are allowed for the variables, which must be of type 8-bit boolean (bool), 32-bit signed integer (int), 64-bit signed integer (long), double-precision floating point number (float), 2x64-bit, double-precision complex number (complex) or raw string of bytes (str). The arrays must all be the same size. ** Building (unchanged, but move down since it's standard and most users will only do this once, if ever) ** Implementation Notes (rest of current How It Works section) ** Credits --jh-- ___ Numpy-discussion mailing
Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator
Francesc Alted schrieb: Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like 3*a+4*b) are accelerated and use less memory than doing the same calculation in Python. The expected speed-ups for Numexpr respect to NumPy are between 0.95x and 15x, being 3x or 4x typical values. The strided and unaligned case has been optimized too, so if the expresion contains such arrays, the speed-up can increase significantly. Of course, you will need to operate with large arrays (typically larger than the cache size of your CPU) to see these improvements in performance. Just recently I had a more detailed look at numexpr. Clever idea, easy to use! I can affirm an typical performance gain of 3x if you work on large arrays (100k entries). I also gave a try to the vector math library (VML), contained in Intel's Math Kernel Library. This offers a fast implementation of mathematical functions, operating on array. First I implemented a C extension, providing new ufuncs. This gave me a big performance gain, e.g., 2.3x (5x) for sin, 6x (10x) for exp, 7x (15x) for pow, and 3x (6x) for division (no gain for add, sub, mul). The values in parantheses are given if I allow VML to use several threads and to employ both cores of my Intel Core2Duo computer. For large arrays (100M entries) this performance gain is reduced because of limited memory bandwidth. At this point I was stumbling across numexpr and modified it to use the VML functions. For sufficiently long and complex numerical expressions I could get the maximum performance also for large arrays. Together with VML numexpr seems to be a extremely powerful to get an optimum performance. I would like to see numexpr extended to (optionally) make use of fast vectorized math functions. There is one but: VML supports (at the moment) only math on contiguous arrays. At a first try I didn't understand how to enforce this limitation in numexpr. I also gave a quick try to the equivalent vector math library, acml_mv of AMD. I only tried sin and log, gave me the same performance (on a Intel processor!) like Intels VML . I was also playing around with the block size in numexpr. What are the rationale that led to the current block size of 128? Especially with VML, a larger block size of 4096 instead of 128 allowed to efficiently use multithreading in VML. Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. I was missing the support for single precision floats. Great work! Gregor ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator
A Friday 16 January 2009, j...@physics.ucf.edu escrigué: Hi Francesc, Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like 3*a+4*b) are accelerated and use less memory than doing the same calculation in Python. Please pardon my ignorance as I know this project has been around for a while. It this looks very exciting, but either it's cumbersome, or I'm not understanding exactly what's being fixed. If you can accelerate evaluation, why not just integrate the faster math into numpy, rather than having two packages? Or is this something that is only an advantage when the expression is given as a string (and why is that the case)? It would be helpful if you could put the answer on your web page and in your standard release blurb in some compact form. I guess what I'm really looking for when I read one of those is a quick answer to the question should I look into this?. Well, there is a link in the project page to the Overview section of the wiki, but perhaps is a bit hidden. I've added some blurb as you suggested in the main page an another link to the Overview wiki page. Hope that, by reading the new blurb, you can see why it accelerates expression evaluation with regard to NumPy. If not, tell me and will try to come with something more comprehensible. Right now, I'm not quite sure whether the problem you are solving is merely the case of expressions-in-strings, and there is no advantage for expressions-in-code, or whether your expressions-in-strings are faster than numpy's expressions-in-code. In either case, it would appear this would be a good addition to the numpy core, and it's past 1.0, so why keep it separate? Even if there is value in having a non-numpy version, is there not also value in accelerating numpy by default? Having the expression encapsulated in a string has the advantage that you exactly know the part of the code that you want to parse and accelerate. Making NumPy to understand parts of the Python code that can be accelerated sounds more like a true JIT for Python, and this is something that is not trivial at all (although, with the advent of PyPy there are appearing some efforts in this direction [1]). [1] http://www.enthought.com/~ischnell/paper.html Cheers, -- Francesc Alted ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator
Hi Francesc, this is a wonderful project ! I was just wondering if you would / could support single precision float arrays ? In 3+D image analysis we generally don't have enough memory to effort double precision; and we could save our selves lots of extra C coding (or Cython) coding of we could use numexpr ;-) Thanks, Sebastian Haase On Fri, Jan 16, 2009 at 5:04 PM, Francesc Alted fal...@pytables.org wrote: A Friday 16 January 2009, j...@physics.ucf.edu escrigué: Hi Francesc, Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like 3*a+4*b) are accelerated and use less memory than doing the same calculation in Python. Please pardon my ignorance as I know this project has been around for a while. It this looks very exciting, but either it's cumbersome, or I'm not understanding exactly what's being fixed. If you can accelerate evaluation, why not just integrate the faster math into numpy, rather than having two packages? Or is this something that is only an advantage when the expression is given as a string (and why is that the case)? It would be helpful if you could put the answer on your web page and in your standard release blurb in some compact form. I guess what I'm really looking for when I read one of those is a quick answer to the question should I look into this?. Well, there is a link in the project page to the Overview section of the wiki, but perhaps is a bit hidden. I've added some blurb as you suggested in the main page an another link to the Overview wiki page. Hope that, by reading the new blurb, you can see why it accelerates expression evaluation with regard to NumPy. If not, tell me and will try to come with something more comprehensible. Right now, I'm not quite sure whether the problem you are solving is merely the case of expressions-in-strings, and there is no advantage for expressions-in-code, or whether your expressions-in-strings are faster than numpy's expressions-in-code. In either case, it would appear this would be a good addition to the numpy core, and it's past 1.0, so why keep it separate? Even if there is value in having a non-numpy version, is there not also value in accelerating numpy by default? Having the expression encapsulated in a string has the advantage that you exactly know the part of the code that you want to parse and accelerate. Making NumPy to understand parts of the Python code that can be accelerated sounds more like a true JIT for Python, and this is something that is not trivial at all (although, with the advent of PyPy there are appearing some efforts in this direction [1]). [1] http://www.enthought.com/~ischnell/paper.html Cheers, -- Francesc Alted ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator
A Friday 16 January 2009, Gregor Thalhammer escrigué: I also gave a try to the vector math library (VML), contained in Intel's Math Kernel Library. This offers a fast implementation of mathematical functions, operating on array. First I implemented a C extension, providing new ufuncs. This gave me a big performance gain, e.g., 2.3x (5x) for sin, 6x (10x) for exp, 7x (15x) for pow, and 3x (6x) for division (no gain for add, sub, mul). Wow, pretty nice speed-ups indeed! In fact I was thinking in including support for threading in Numexpr (I don't think it would be too difficult, but let's see). BTW, do you know how VML is able to achieve a speedup of 6x for a sin() function? I suppose this is because they are using SSE instructions, but, are these also available for 64-bit double precision items? The values in parantheses are given if I allow VML to use several threads and to employ both cores of my Intel Core2Duo computer. For large arrays (100M entries) this performance gain is reduced because of limited memory bandwidth. At this point I was stumbling across numexpr and modified it to use the VML functions. For sufficiently long and complex numerical expressions I could get the maximum performance also for large arrays. Cool. Together with VML numexpr seems to be a extremely powerful to get an optimum performance. I would like to see numexpr extended to (optionally) make use of fast vectorized math functions. Well, if you can provide the code, I'd be glad to include it in numexpr. The only requirement is that the VML must be optional during the build of the package. There is one but: VML supports (at the moment) only math on contiguous arrays. At a first try I didn't understand how to enforce this limitation in numexpr. No problem. At the end of the numexpr/necompiler.py you will see some code like: # All the opcodes can deal with strided arrays directly as # long as they are undimensional (strides in other # dimensions are dealt within the extension), so we don't # need a copy for the strided case. if not b.flags.aligned: ... which you can replace with something like: # need a copy for the strided case. if VML_available and not b.flags.contiguous: b = b.copy() elif not b.flags.aligned: ... That would be enough for ensuring that all the arrays are contiguous when they hit numexpr's virtual machine. Being said this, it is a shame that VML does not have support for strided/unaligned arrays. They are quite common beasts, specially when you work with heterogeneous arrays (aka record arrays). I also gave a quick try to the equivalent vector math library, acml_mv of AMD. I only tried sin and log, gave me the same performance (on a Intel processor!) like Intels VML . I was also playing around with the block size in numexpr. What are the rationale that led to the current block size of 128? Especially with VML, a larger block size of 4096 instead of 128 allowed to efficiently use multithreading in VML. Experimentation. Back in 2006 David found that 128 was optimal for the processors available by that time. With the Numexpr 1.1 my experiments show that 256 is a better value for current Core2 processors and most expressions in our benchmark bed (see benchmarks/ directory); hence, 256 is the new value for the chunksize in 1.1. However, be in mind that 256 has to be multiplied by the itemsize of each array, so the chunksize is currently 2048 bytes for 64-bit items (int64 or float64) and 4096 for double precision complex arrays, which are probably the sizes that have to be compared with VML. Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. I was missing the support for single precision floats. Yeah. This is because nobody has implemented it before, but it is completely doable. Great work! You are welcome! And thanks for excellent feedback too! Hope we can have a VML-aware numexpr anytime soon ;-) Cheers, -- Francesc Alted ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator
A Friday 16 January 2009, Sebastian Haase escrigué: Hi Francesc, this is a wonderful project ! I was just wondering if you would / could support single precision float arrays ? As I said before, it is doable, but I don't know if I will have time enough to implement this myself. In 3+D image analysis we generally don't have enough memory to effort double precision; and we could save our selves lots of extra C coding (or Cython) coding of we could use numexpr ;-) Well, one of the ideas that I'm toying long time ago is to provide the capability to Numexpr to work with PyTables disk-based objects. That way, you would be able to evaluate potentially complex expressions by using data that is completely on-disk. But this might be a completely different thing of what you are talking about. Cheers, -- Francesc Alted ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator
Francesc Alted wrote: A Friday 16 January 2009, j...@physics.ucf.edu escrigué: Right now, I'm not quite sure whether the problem you are solving is merely the case of expressions-in-strings, and there is no advantage for expressions-in-code, or whether your expressions-in-strings are faster than numpy's expressions-in-code. In either case, it would appear this would be a good addition to the numpy core, and it's past 1.0, so why keep it separate? Even if there is value in having a non-numpy version, is there not also value in accelerating numpy by default? Having the expression encapsulated in a string has the advantage that you exactly know the part of the code that you want to parse and accelerate. Making NumPy to understand parts of the Python code that can be accelerated sounds more like a true JIT for Python, and this is something that is not trivial at all (although, with the advent of PyPy there are appearing some efforts in this direction [1]). A full compiler/JIT isn't needed, there's another route: One could use the Numexpr methodology together with a symbolic expression framework (like SymPy or the one in Sage). I.e. operator overloads and lazy expressions. Combining NumExpr with a symbolic manipulation engine would be very cool IMO. Unfortunately I don't have time myself (and I understand that you don't, I'm just mentioning it). Example using psuedo-Sage-like syntax: a = np.arange(bignum) b = np.arange(bignum) x, y = sage.var(x, y) expr = sage.integrate(x + y, x) z = expr(x=a, y=b) # z = a**2/2 + b, but Numexpr-enabled -- Dag Sverre ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator
Francesc Alted schrieb: A Friday 16 January 2009, Gregor Thalhammer escrigué: I also gave a try to the vector math library (VML), contained in Intel's Math Kernel Library. This offers a fast implementation of mathematical functions, operating on array. First I implemented a C extension, providing new ufuncs. This gave me a big performance gain, e.g., 2.3x (5x) for sin, 6x (10x) for exp, 7x (15x) for pow, and 3x (6x) for division (no gain for add, sub, mul). Wow, pretty nice speed-ups indeed! In fact I was thinking in including support for threading in Numexpr (I don't think it would be too difficult, but let's see). BTW, do you know how VML is able to achieve a speedup of 6x for a sin() function? I suppose this is because they are using SSE instructions, but, are these also available for 64-bit double precision items? I am not an expert on SSE instructions, but to my knowledge there exist (in the Core 2 architecture) no SSE instruction to calculate the sin. But it seems to be possible to (approximately) calculate a sin with a couple of multiplication/ addition instructions (and they exist in SSE for 64-bit float). Intel (and AMD) seems to use a more clever algorithm, efficiently implemented than the standard implementation. Well, if you can provide the code, I'd be glad to include it in numexpr. The only requirement is that the VML must be optional during the build of the package. Yes, I will try to provide you with a polished version of my changes, making them optional. There is one but: VML supports (at the moment) only math on contiguous arrays. At a first try I didn't understand how to enforce this limitation in numexpr. No problem. At the end of the numexpr/necompiler.py you will see some code like: # All the opcodes can deal with strided arrays directly as # long as they are undimensional (strides in other # dimensions are dealt within the extension), so we don't # need a copy for the strided case. if not b.flags.aligned: ... which you can replace with something like: # need a copy for the strided case. if VML_available and not b.flags.contiguous: b = b.copy() elif not b.flags.aligned: ... That would be enough for ensuring that all the arrays are contiguous when they hit numexpr's virtual machine. Ah I see, that's not difficult. I thought copying is done in the virtual machine. (didn't read all the code ...) Being said this, it is a shame that VML does not have support for strided/unaligned arrays. They are quite common beasts, specially when you work with heterogeneous arrays (aka record arrays). I have the impression that you can already feel happy if these mathematical libraries support a C interface, not only Fortran. At least the Intel VML provides functions to pack/unpack strided arrays which seem work on a broader parameter range than specified (also zero or negative step sizes). I also gave a quick try to the equivalent vector math library, acml_mv of AMD. I only tried sin and log, gave me the same performance (on a Intel processor!) like Intels VML . I was also playing around with the block size in numexpr. What are the rationale that led to the current block size of 128? Especially with VML, a larger block size of 4096 instead of 128 allowed to efficiently use multithreading in VML. Experimentation. Back in 2006 David found that 128 was optimal for the processors available by that time. With the Numexpr 1.1 my experiments show that 256 is a better value for current Core2 processors and most expressions in our benchmark bed (see benchmarks/ directory); hence, 256 is the new value for the chunksize in 1.1. However, be in mind that 256 has to be multiplied by the itemsize of each array, so the chunksize is currently 2048 bytes for 64-bit items (int64 or float64) and 4096 for double precision complex arrays, which are probably the sizes that have to be compared with VML. So the optimum block size might depend on the type of expression and if VML functions are used. On question: the block size is set by a #define, is there a significantly poorer performance if you use a variable instead? Would be more flexible, especially for testing and tuning. I was missing the support for single precision floats. Yeah. This is because nobody has implemented it before, but it is completely doable. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator
Note that Apple has a similar library called vForce: http://developer.apple.com/ReleaseNotes/Performance/RN-vecLib/index.html http://developer.apple.com/documentation/Performance/Conceptual/vecLib/Reference/reference.html I think these libraries use several techniques and are not necessarily dependent on SSE. The apple versions appear to only support float and double (no complex), and I don't see anything about strided arrays. At one point I thought there was talk of adding support for vForce into the respective ufuncs. I don't know if anybody followed up on that. On 2009-01-16, at 10:48, Francesc Alted wrote: Wow, pretty nice speed-ups indeed! In fact I was thinking in including support for threading in Numexpr (I don't think it would be too difficult, but let's see). BTW, do you know how VML is able to achieve a speedup of 6x for a sin() function? I suppose this is because they are using SSE instructions, but, are these also available for 64-bit double precision items? ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator
2009/1/16 Gregor Thalhammer gregor.thalham...@gmail.com: Francesc Alted schrieb: Wow, pretty nice speed-ups indeed! In fact I was thinking in including support for threading in Numexpr (I don't think it would be too difficult, but let's see). BTW, do you know how VML is able to achieve a speedup of 6x for a sin() function? I suppose this is because they are using SSE instructions, but, are these also available for 64-bit double precision items? I am not an expert on SSE instructions, but to my knowledge there exist (in the Core 2 architecture) no SSE instruction to calculate the sin. But it seems to be possible to (approximately) calculate a sin with a couple of multiplication/ addition instructions (and they exist in SSE for 64-bit float). Intel (and AMD) seems to use a more clever algorithm, Here is the lib I use for the transcendental functions SSE implementations: http://gruntthepeon.free.fr/ssemath/ (only simple precision float though). -- Olivier ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion