Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-21 Thread Francesc Alted
A Tuesday 20 January 2009, Andrew Collette escrigué:
 Works much, much better with the current svn version. :) Numexpr now
 outperforms everything except the simple technique, and then only
 for small data sets.

Correct.  This is because of the cost of parsing the expression and 
initializing the virtual machine.  However, as soon as the sizes of the 
operands exceeds the cache of your processor you are starting to see 
the improvement in performance.

 Along the lines you mentioned I noticed that simply changing from a
 shape of (100*100*100,) to (100, 100, 100) results in nearly a factor
 of 2 worse performance, a factor which seems constant when changing
 the size of the data set.

Sorry, but I cannot reproduce this.  When using the expression:

63 + (a*b) + (c**2) + b

I get on my machine (co...@3 GHz, running openSUSE Linux 11.1):

100 f8 (average of 10 runs)
Simple:  0.0278068065643
Numexpr:  0.00839750766754
Chunked:  0.0266514062881

(100, 100, 100) f8 (average of 10 runs)
Simple:  0.0277318000793
Numexpr:  0.00851640701294
Chunked:  0.0346593856812

and these are the expected results (i.e. no change in performance due to 
multidimensional arrays).  Even for larger arrays, I don't see nothing 
unexpected:

1000 f8 (average of 10 runs)
Simple:  0.334054994583
Numexpr:  0.110022115707
Chunked:  0.29678030014

(100, 100, 100, 10) f8 (average of 10 runs)
Simple:  0.339299607277
Numexpr:  0.111632704735
Chunked:  0.375299096107

Can you tell us which platforms are you using?

 Is this related to the way numexpr handles 
 broadcasting rules?  It would seem the memory contents should be
 identical for these two cases.

 Andrew

 On Tue, Jan 20, 2009 at 6:13 AM, Francesc Alted fal...@pytables.org 
wrote:
  A Tuesday 20 January 2009, Andrew Collette escrigué:
  Hi Francesc,
 
  Looks like a cool project!  However, I'm not able to achieve the
  advertised speed-ups.  I wrote a simple script to try three
  approaches to this kind of problem:
 
  1) Native Python code (i.e. will try to do everything at once
  using temp arrays) 2) Straightforward numexpr evaluation
  3) Simple chunked evaluation using array.flat views.  (This
  solves the memory problem and allows the use of arbitrary Python
  expressions).
 
  I've attached the script; here's the output for the expression
  63 + (a*b) + (c**2) + sin(b)
  along with a few combinations of shapes/dtypes.  As expected,
  using anything other than f8 (double) results in a performance
  penalty. Surprisingly, it seems that using chunks via array.flat
  results in similar performance for f8, and even better performance
  for other dtypes.
 
  [clip]
 
  Well, there were two issues there.  The first one is that when
  transcendental functions are used (like sin() above), the
  bottleneck is on the CPU instead of memory bandwidth, so numexpr
  speedups are not so high as usual.  The other issue was an actual
  bug in the numexpr code that forced a copy of all multidimensional
  arrays (I normally only use undimensional arrays for doing
  benchmarks).  This has been fixed in trunk (r39).
 
  So, with the fix on, the timings are:
 
  (100, 100, 100) f4 (average of 10 runs)
  Simple:  0.0426136016846
  Numexpr:  0.11350851059
  Chunked:  0.0635252952576
  (100, 100, 100) f8 (average of 10 runs)
  Simple:  0.119254398346
  Numexpr:  0.10092959404
  Chunked:  0.128384995461
 
  The speed-up is now a mere 20% (for f8), but at least it is not
  slower. With the patches that recently contributed Georg for using
  Intel's VML, the acceleration is a bit better:
 
  (100, 100, 100) f4 (average of 10 runs)
  Simple:  0.0417867898941
  Numexpr:  0.0944641113281
  Chunked:  0.0636183023453
  (100, 100, 100) f8 (average of 10 runs)
  Simple:  0.120059680939
  Numexpr:  0.0832288980484
  Chunked:  0.128114104271
 
  i.e. the speed-up is around 45% (for f8).
 
  Moreover, if I get rid of the sin() function and use the expresion:
 
  63 + (a*b) + (c**2) + b
 
  I get:
 
  (100, 100, 100) f4 (average of 10 runs)
  Simple:  0.0119329929352
  Numexpr:  0.0198570966721
  Chunked:  0.0338240146637
  (100, 100, 100) f8 (average of 10 runs)
  Simple:  0.0255623102188
  Numexpr:  0.00832500457764
  Chunked:  0.0340095996857
 
  which has a 3.1x speedup (for f8).
 
  FYI, the current tar file (1.1-1) has a glitch related to the
  VERSION file; I added to the bug report at google code.
 
  Thanks. Will focus on that asap.  Mmm, seems like there is stuff
  enough for another release of numexpr.  I'll try to do it soon.
 
  Cheers,
 
  --
  Francesc Alted
  ___
  Numpy-discussion mailing list
  Numpy-discussion@scipy.org
  http://projects.scipy.org/mailman/listinfo/numpy-discussion

 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion



-- 
Francesc Alted
___
Numpy-discussion mailing 

Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-21 Thread Andrew Collette
Hi,

I get identical results for both shapes now; I manually removed the
numexpr-1.1.1.dev-py2.5-linux-i686.egg folder in site-packages and
reinstalled.  I suppose there must have been a stale set of files
somewhere.

Andrew Collette

On Wed, Jan 21, 2009 at 3:41 AM, Francesc Alted fal...@pytables.org wrote:
 A Tuesday 20 January 2009, Andrew Collette escrigué:
 Works much, much better with the current svn version. :) Numexpr now
 outperforms everything except the simple technique, and then only
 for small data sets.

 Correct.  This is because of the cost of parsing the expression and
 initializing the virtual machine.  However, as soon as the sizes of the
 operands exceeds the cache of your processor you are starting to see
 the improvement in performance.

 Along the lines you mentioned I noticed that simply changing from a
 shape of (100*100*100,) to (100, 100, 100) results in nearly a factor
 of 2 worse performance, a factor which seems constant when changing
 the size of the data set.

 Sorry, but I cannot reproduce this.  When using the expression:

 63 + (a*b) + (c**2) + b

 I get on my machine (co...@3 GHz, running openSUSE Linux 11.1):

 100 f8 (average of 10 runs)
 Simple:  0.0278068065643
 Numexpr:  0.00839750766754
 Chunked:  0.0266514062881

 (100, 100, 100) f8 (average of 10 runs)
 Simple:  0.0277318000793
 Numexpr:  0.00851640701294
 Chunked:  0.0346593856812

 and these are the expected results (i.e. no change in performance due to
 multidimensional arrays).  Even for larger arrays, I don't see nothing
 unexpected:

 1000 f8 (average of 10 runs)
 Simple:  0.334054994583
 Numexpr:  0.110022115707
 Chunked:  0.29678030014

 (100, 100, 100, 10) f8 (average of 10 runs)
 Simple:  0.339299607277
 Numexpr:  0.111632704735
 Chunked:  0.375299096107

 Can you tell us which platforms are you using?

 Is this related to the way numexpr handles
 broadcasting rules?  It would seem the memory contents should be
 identical for these two cases.

 Andrew

 On Tue, Jan 20, 2009 at 6:13 AM, Francesc Alted fal...@pytables.org
 wrote:
  A Tuesday 20 January 2009, Andrew Collette escrigué:
  Hi Francesc,
 
  Looks like a cool project!  However, I'm not able to achieve the
  advertised speed-ups.  I wrote a simple script to try three
  approaches to this kind of problem:
 
  1) Native Python code (i.e. will try to do everything at once
  using temp arrays) 2) Straightforward numexpr evaluation
  3) Simple chunked evaluation using array.flat views.  (This
  solves the memory problem and allows the use of arbitrary Python
  expressions).
 
  I've attached the script; here's the output for the expression
  63 + (a*b) + (c**2) + sin(b)
  along with a few combinations of shapes/dtypes.  As expected,
  using anything other than f8 (double) results in a performance
  penalty. Surprisingly, it seems that using chunks via array.flat
  results in similar performance for f8, and even better performance
  for other dtypes.
 
  [clip]
 
  Well, there were two issues there.  The first one is that when
  transcendental functions are used (like sin() above), the
  bottleneck is on the CPU instead of memory bandwidth, so numexpr
  speedups are not so high as usual.  The other issue was an actual
  bug in the numexpr code that forced a copy of all multidimensional
  arrays (I normally only use undimensional arrays for doing
  benchmarks).  This has been fixed in trunk (r39).
 
  So, with the fix on, the timings are:
 
  (100, 100, 100) f4 (average of 10 runs)
  Simple:  0.0426136016846
  Numexpr:  0.11350851059
  Chunked:  0.0635252952576
  (100, 100, 100) f8 (average of 10 runs)
  Simple:  0.119254398346
  Numexpr:  0.10092959404
  Chunked:  0.128384995461
 
  The speed-up is now a mere 20% (for f8), but at least it is not
  slower. With the patches that recently contributed Georg for using
  Intel's VML, the acceleration is a bit better:
 
  (100, 100, 100) f4 (average of 10 runs)
  Simple:  0.0417867898941
  Numexpr:  0.0944641113281
  Chunked:  0.0636183023453
  (100, 100, 100) f8 (average of 10 runs)
  Simple:  0.120059680939
  Numexpr:  0.0832288980484
  Chunked:  0.128114104271
 
  i.e. the speed-up is around 45% (for f8).
 
  Moreover, if I get rid of the sin() function and use the expresion:
 
  63 + (a*b) + (c**2) + b
 
  I get:
 
  (100, 100, 100) f4 (average of 10 runs)
  Simple:  0.0119329929352
  Numexpr:  0.0198570966721
  Chunked:  0.0338240146637
  (100, 100, 100) f8 (average of 10 runs)
  Simple:  0.0255623102188
  Numexpr:  0.00832500457764
  Chunked:  0.0340095996857
 
  which has a 3.1x speedup (for f8).
 
  FYI, the current tar file (1.1-1) has a glitch related to the
  VERSION file; I added to the bug report at google code.
 
  Thanks. Will focus on that asap.  Mmm, seems like there is stuff
  enough for another release of numexpr.  I'll try to do it soon.
 
  Cheers,
 
  --
  Francesc Alted
  ___
  Numpy-discussion mailing list
  

Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-20 Thread Francesc Alted
A Tuesday 20 January 2009, Andrew Collette escrigué:
 Hi Francesc,

 Looks like a cool project!  However, I'm not able to achieve the
 advertised speed-ups.  I wrote a simple script to try three
 approaches to this kind of problem:

 1) Native Python code (i.e. will try to do everything at once using
 temp arrays) 2) Straightforward numexpr evaluation
 3) Simple chunked evaluation using array.flat views.  (This solves
 the memory problem and allows the use of arbitrary Python
 expressions).

 I've attached the script; here's the output for the expression
 63 + (a*b) + (c**2) + sin(b)
 along with a few combinations of shapes/dtypes.  As expected, using
 anything other than f8 (double) results in a performance penalty.
 Surprisingly, it seems that using chunks via array.flat results in
 similar performance for f8, and even better performance for other
 dtypes.
[clip]

Well, there were two issues there.  The first one is that when 
transcendental functions are used (like sin() above), the bottleneck is 
on the CPU instead of memory bandwidth, so numexpr speedups are not so 
high as usual.  The other issue was an actual bug in the numexpr code 
that forced a copy of all multidimensional arrays (I normally only use 
undimensional arrays for doing benchmarks).  This has been fixed in 
trunk (r39).

So, with the fix on, the timings are:

(100, 100, 100) f4 (average of 10 runs)
Simple:  0.0426136016846
Numexpr:  0.11350851059
Chunked:  0.0635252952576
(100, 100, 100) f8 (average of 10 runs)
Simple:  0.119254398346
Numexpr:  0.10092959404
Chunked:  0.128384995461

The speed-up is now a mere 20% (for f8), but at least it is not slower.  
With the patches that recently contributed Georg for using Intel's VML, 
the acceleration is a bit better:

(100, 100, 100) f4 (average of 10 runs)
Simple:  0.0417867898941
Numexpr:  0.0944641113281
Chunked:  0.0636183023453
(100, 100, 100) f8 (average of 10 runs)
Simple:  0.120059680939
Numexpr:  0.0832288980484
Chunked:  0.128114104271

i.e. the speed-up is around 45% (for f8).

Moreover, if I get rid of the sin() function and use the expresion:

63 + (a*b) + (c**2) + b

I get:

(100, 100, 100) f4 (average of 10 runs)
Simple:  0.0119329929352
Numexpr:  0.0198570966721
Chunked:  0.0338240146637
(100, 100, 100) f8 (average of 10 runs)
Simple:  0.0255623102188
Numexpr:  0.00832500457764
Chunked:  0.0340095996857

which has a 3.1x speedup (for f8).

 FYI, the current tar file (1.1-1) has a glitch related to the VERSION
 file; I added to the bug report at google code.

Thanks. Will focus on that asap.  Mmm, seems like there is stuff enough 
for another release of numexpr.  I'll try to do it soon.

Cheers,

-- 
Francesc Alted
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-20 Thread Andrew Collette
Works much, much better with the current svn version. :) Numexpr now
outperforms everything except the simple technique, and then only
for small data sets.

Along the lines you mentioned I noticed that simply changing from a
shape of (100*100*100,) to (100, 100, 100) results in nearly a factor
of 2 worse performance, a factor which seems constant when changing
the size of the data set.  Is this related to the way numexpr handles
broadcasting rules?  It would seem the memory contents should be
identical for these two cases.

Andrew

On Tue, Jan 20, 2009 at 6:13 AM, Francesc Alted fal...@pytables.org wrote:
 A Tuesday 20 January 2009, Andrew Collette escrigué:
 Hi Francesc,

 Looks like a cool project!  However, I'm not able to achieve the
 advertised speed-ups.  I wrote a simple script to try three
 approaches to this kind of problem:

 1) Native Python code (i.e. will try to do everything at once using
 temp arrays) 2) Straightforward numexpr evaluation
 3) Simple chunked evaluation using array.flat views.  (This solves
 the memory problem and allows the use of arbitrary Python
 expressions).

 I've attached the script; here's the output for the expression
 63 + (a*b) + (c**2) + sin(b)
 along with a few combinations of shapes/dtypes.  As expected, using
 anything other than f8 (double) results in a performance penalty.
 Surprisingly, it seems that using chunks via array.flat results in
 similar performance for f8, and even better performance for other
 dtypes.
 [clip]

 Well, there were two issues there.  The first one is that when
 transcendental functions are used (like sin() above), the bottleneck is
 on the CPU instead of memory bandwidth, so numexpr speedups are not so
 high as usual.  The other issue was an actual bug in the numexpr code
 that forced a copy of all multidimensional arrays (I normally only use
 undimensional arrays for doing benchmarks).  This has been fixed in
 trunk (r39).

 So, with the fix on, the timings are:

 (100, 100, 100) f4 (average of 10 runs)
 Simple:  0.0426136016846
 Numexpr:  0.11350851059
 Chunked:  0.0635252952576
 (100, 100, 100) f8 (average of 10 runs)
 Simple:  0.119254398346
 Numexpr:  0.10092959404
 Chunked:  0.128384995461

 The speed-up is now a mere 20% (for f8), but at least it is not slower.
 With the patches that recently contributed Georg for using Intel's VML,
 the acceleration is a bit better:

 (100, 100, 100) f4 (average of 10 runs)
 Simple:  0.0417867898941
 Numexpr:  0.0944641113281
 Chunked:  0.0636183023453
 (100, 100, 100) f8 (average of 10 runs)
 Simple:  0.120059680939
 Numexpr:  0.0832288980484
 Chunked:  0.128114104271

 i.e. the speed-up is around 45% (for f8).

 Moreover, if I get rid of the sin() function and use the expresion:

 63 + (a*b) + (c**2) + b

 I get:

 (100, 100, 100) f4 (average of 10 runs)
 Simple:  0.0119329929352
 Numexpr:  0.0198570966721
 Chunked:  0.0338240146637
 (100, 100, 100) f8 (average of 10 runs)
 Simple:  0.0255623102188
 Numexpr:  0.00832500457764
 Chunked:  0.0340095996857

 which has a 3.1x speedup (for f8).

 FYI, the current tar file (1.1-1) has a glitch related to the VERSION
 file; I added to the bug report at google code.

 Thanks. Will focus on that asap.  Mmm, seems like there is stuff enough
 for another release of numexpr.  I'll try to do it soon.

 Cheers,

 --
 Francesc Alted
 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-19 Thread jh
Thanks!  I think this will help the package attract a lot of users.

A couple of housekeeping things:

on http://code.google.com/p/numexpr:

  What it is? - What is it? or What it is (no question mark)

on http://code.google.com/p/numexpr/wiki/Overview:

  The last example got incorporated as straight text somehow.

In firefox, the first code example runs into the pastel boxes on the
right for modest-width browsers.  This is a common problem with
firefox, but I think it comes from improper HTML code that IE somehow
deals with, rather than non-standard behavior in firefox.

One thing I'd add is a benchmark example against numpy.  Make it
simple, so that people can copy and modify the benchmark code to test
their own performance improvements.

I added an entry for it on the Topical Software list.  Please check it
out and modify as you see fit

--jh--
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-19 Thread Andrew Collette
Hi Francesc,

Looks like a cool project!  However, I'm not able to achieve the
advertised speed-ups.  I wrote a simple script to try three approaches
to this kind of problem:

1) Native Python code (i.e. will try to do everything at once using temp arrays)
2) Straightforward numexpr evaluation
3) Simple chunked evaluation using array.flat views.  (This solves
the memory problem and allows the use of arbitrary Python
expressions).

I've attached the script; here's the output for the expression
63 + (a*b) + (c**2) + sin(b)
along with a few combinations of shapes/dtypes.  As expected, using
anything other than f8 (double) results in a performance penalty.
Surprisingly, it seems that using chunks via array.flat results in
similar performance for f8, and even better performance for other
dtypes.

(100, 100, 100) f4 (average of 10 runs)
Simple:  0.155238199234
Numexpr:  0.278440499306
Chunked:  0.166213512421

(100, 100, 100) f8 (average of 10 runs)
Simple:  0.241649699211
Numexpr:  0.192837905884
Chunked:  0.183888602257

(100, 100, 100, 10) f4 (average of 10 runs)
Simple:  1.56741549969
Numexpr:  3.40679829121
Chunked:  1.83729870319

(100, 100, 100) i4 (average of 10 runs)
Simple:  0.206279683113
Numexpr:  0.210431909561
Chunked:  0.182894086838

FYI, the current tar file (1.1-1) has a glitch related to the VERSION
file; I added to the bug report at google code.

Andrew Collette

On Fri, Jan 16, 2009 at 4:00 AM, Francesc Alted fal...@pytables.org wrote:
 
  Announcing Numexpr 1.1
 

 Numexpr is a fast numerical expression evaluator for NumPy.  With it,
 expressions that operate on arrays (like 3*a+4*b) are accelerated
 and use less memory than doing the same calculation in Python.

 The expected speed-ups for Numexpr respect to NumPy are between 0.95x
 and 15x, being 3x or 4x typical values.  The strided and unaligned
 case has been optimized too, so if the expresion contains such arrays,
 the speed-up can increase significantly.  Of course, you will need to
 operate with large arrays (typically larger than the cache size of your
 CPU) to see these improvements in performance.

 This release is mainly intended to put in sync some of the
 improvements that had the Numexpr version integrated in PyTables.
 So, this standalone version of Numexpr will benefit of the well tested
 PyTables' version that has been in production for more than a year now.

 In case you want to know more in detail what has changed in this
 version, have a look at ``RELEASE_NOTES.txt`` in the tarball.


 Where I can find Numexpr?
 =

 The project is hosted at Google code in:

 http://code.google.com/p/numexpr/


 Share your experience
 =

 Let us know of any bugs, suggestions, gripes, kudos, etc. you may
 have.


 Enjoy!

 --
 Francesc Alted
 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion



import numpy as np
import numexpr as nx
import time

test_shape = (100,100,100)   # All 3 arrays have this shape
test_dtype = 'i4'
nruns = 10   # Ensemble for timing

test_size = np.product(test_shape)

def chunkify(chunksize):
 Very stupid chunk vectorizer which keeps memory use down.
This version requires all inputs to have the same number of elements,
although it shouldn't be that hard to implement simple broadcasting.


def chunkifier(func):

def wrap(*args):

assert len(args)  0
assert all(len(a.flat) == len(args[0].flat) for a in args)

nelements = len(args[0].flat)
nchunks, remain = divmod(nelements, chunksize)

out = np.ndarray(args[0].shape)

for start in xrange(0, nelements, chunksize):
#print start
stop = start+chunksize
if start+chunksize  nelements:
stop = nelements-start
iargs = tuple(a.flat[start:stop] for a in args)
out.flat[start:stop] = func(*iargs)
return out

return wrap

return chunkifier

test_func_str = 63 + (a*b) + (c**2) + sin(b)

def test_func(a, b, c):
return 63 + (a*b) + (c**2) + np.sin(b)

test_func_chunked = chunkify(100*100)(test_func)

# The actual data we'll use
a = np.arange(test_size, dtype=test_dtype).reshape(test_shape)
b = np.arange(test_size, dtype=test_dtype).reshape(test_shape)
c = np.arange(test_size, dtype=test_dtype).reshape(test_shape)


start1 = time.time()
for idx in xrange(nruns):
result1 = test_func(a, b, c)
stop1 = time.time()

start2 = time.time()
for idx in xrange(nruns):
result2 = nx.evaluate(test_func_str)
stop2 = time.time()

start3 = time.time()
for idx in xrange(nruns):
result3 = test_func_chunked(a, b, c)
stop3 = time.time()

print %s %s (average of %s runs) % (test_shape, test_dtype, nruns)
print Simple: , 

Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-18 Thread jh
Francesc Alted wrote:

   Numexpr is a fast numerical expression evaluator for NumPy.  With
   it, expressions that operate on arrays (like 3*a+4*b) are
   accelerated and use less memory than doing the same calculation in
   Python.

  Please pardon my ignorance as I know this project has been around for
  a while.  It this looks very exciting, but either it's cumbersome, or
  I'm not understanding exactly what's being fixed.  If you can
  accelerate evaluation, why not just integrate the faster math into
  numpy, rather than having two packages?  Or is this something that is
  only an advantage when the expression is given as a string (and why
  is that the case)?  It would be helpful if you could put the answer
  on your web page and in your standard release blurb in some compact
  form. I guess what I'm really looking for when I read one of those is
  a quick answer to the question should I look into this?.

 Well, there is a link in the project page to the Overview section of 
 the wiki, but perhaps is a bit hidden.  I've added some blurb as you 
 suggested in the main page an another link to the Overview wiki page.
 Hope that, by reading the new blurb, you can see why it accelerates 
 expression evaluation with regard to NumPy.  If not, tell me and will 
 try to come with something more comprehensible.

I did see the overview.  The addition you made is great but it's so
far down that many won't get to it.  Even in its section, the meat of
it is below three paragraphs that most users won't care about and many
won't understand.  I've posted some notes on writing intros in
Developer_Zone.

In the following, I've reordered the page to address the questions of
potential users first, edited it a bit, and fixed the example to
conform to our doc standards (and 128-256; hope that was right).  See
what you think...

** Description:

The numexpr package evaluates multiple-operator array expressions many
times faster than numpy can.  It accepts the expression as a string,
analyzes it, rewrites it more efficiently, and compiles it to faster
Python code on the fly.  It's the next best thing to writing the
expression in C and compiling it with an optimizing compiler (as
scipy.weave does), but requires no compiler at runtime.

Using it is simple:

 import numpy as np
 import numexpr as ne
 a = np.arange(10)
 b = np.arange(0, 20, 2)
 c = ne.evaluate(2*a+3*b)
 c
array([ 0,  8, 16, 24, 32, 40, 48, 56, 64, 72])

** Why does it work?

There are two extremes to array expression evaluation.  Each binary
operation can run separately over the array elements and return a
temporary array.  This is what NumPy does: 2*a + 3*b uses three
temporary arrays as large as a or b.  This strategy wastes memory (a
problem if the arrays are large).  It is also not a good use of CPU
cache memory because the results of 2*a and 3*b will not be in cache
for the final addition if the arrays are large.

The other extreme is to loop over each element:

for i in xrange(len(a)):
c[i] = 2*a[i] + 3*b[i]

This conserves memory and is good for the cache, but on each iteration
Python must check the type of each operand and select the correct
routine for each operation.  All but the first such checks are wasted,
as the input arrays are not changing.

numexpr uses an in-between approach.  Arrays are handled in chunks
(the first pass uses 256 elements).  As Python code, it looks
something like this:

for i in xrange(0, len(a), 256):
r0 = a[i:i+256]
r1 = b[i:i+256]
multiply(r0, 2, r2)
multiply(r1, 3, r3)
add(r2, r3, r2)
c[i:i+256] = r2

The 3-argument form of add() stores the result in the third argument,
instead of allocating a new array.  This achieves a good balance
between cache and branch prediction.  The virtual machine is written
entirely in C, which makes it faster than the Python above.

** Supported Operators (unchanged)

** Supported Functions (unchanged, but capitalize 'F')

** Usage Notes (no need to repeat the example)

Numexpr's principal routine is:

evaluate(ex, local_dict=None, global_dict=None, **kwargs)

ex is a string forming an expression, like 2*a+3*b.  The values for
a and b will by default be taken from the calling function's frame
(through the use of sys._getframe()).  Alternatively, they can be
specified using the local_dict or global_dict` arguments, or passed as
keyword arguments.

Expressions are cached, so reuse is fast.  Arrays or scalars are
allowed for the variables, which must be of type 8-bit boolean (bool),
32-bit signed integer (int), 64-bit signed integer (long),
double-precision floating point number (float), 2x64-bit,
double-precision complex number (complex) or raw string of bytes
(str).  The arrays must all be the same size.

** Building (unchanged, but move down since it's standard and most
   users will only do this once, if ever)

** Implementation Notes (rest of current How It Works section)

** Credits

--jh--
___
Numpy-discussion mailing 

Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-16 Thread Gregor Thalhammer
Francesc Alted schrieb:
 Numexpr is a fast numerical expression evaluator for NumPy.  With it,
 expressions that operate on arrays (like 3*a+4*b) are accelerated
 and use less memory than doing the same calculation in Python.

 The expected speed-ups for Numexpr respect to NumPy are between 0.95x
 and 15x, being 3x or 4x typical values.  The strided and unaligned
 case has been optimized too, so if the expresion contains such arrays, 
 the speed-up can increase significantly.  Of course, you will need to 
 operate with large arrays (typically larger than the cache size of your 
 CPU) to see these improvements in performance.

   
Just recently I had a more detailed look at numexpr. Clever idea, easy 
to use! I can affirm an typical performance gain of 3x if you work on 
large arrays (100k entries).

I also gave a try to the vector math library (VML), contained in Intel's 
Math Kernel Library. This offers a fast implementation of mathematical 
functions, operating on array. First I implemented a C extension, 
providing new ufuncs. This gave me a big performance gain, e.g., 2.3x 
(5x) for sin, 6x (10x) for exp, 7x (15x) for pow, and 3x (6x) for 
division (no gain for add, sub, mul). The values in parantheses are 
given if I allow VML to use several threads and to employ both cores of 
my Intel Core2Duo computer. For large arrays (100M entries) this 
performance gain is reduced because of limited memory bandwidth. At this 
point I was stumbling across numexpr and modified it to use the VML 
functions. For sufficiently long and complex  numerical expressions I 
could  get the maximum performance also for large arrays.  Together with 
VML numexpr seems to be a extremely powerful to get an optimum 
performance. I would like to see numexpr extended to (optionally) make 
use of fast vectorized math functions. There is one but: VML supports 
(at the moment) only math on contiguous arrays. At a first try I didn't 
understand how to enforce this limitation in numexpr. I also gave a 
quick try to the equivalent vector math library, acml_mv of AMD. I only 
tried sin and log, gave me the same performance (on a Intel processor!) 
like Intels VML .

I was also playing around with the block size in numexpr. What are the 
rationale that led to the current block size of 128? Especially with 
VML, a larger block size of 4096 instead of 128 allowed  to efficiently 
use multithreading in VML.
 Share your experience
 =

 Let us know of any bugs, suggestions, gripes, kudos, etc. you may
 have.

   
I was missing the support for single precision floats.

Great work!

Gregor
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-16 Thread Francesc Alted
A Friday 16 January 2009, j...@physics.ucf.edu escrigué:
 Hi Francesc,

  Numexpr is a fast numerical expression evaluator for NumPy.  With
  it, expressions that operate on arrays (like 3*a+4*b) are
  accelerated and use less memory than doing the same calculation in
  Python.

 Please pardon my ignorance as I know this project has been around for
 a while.  It this looks very exciting, but either it's cumbersome, or
 I'm not understanding exactly what's being fixed.  If you can
 accelerate evaluation, why not just integrate the faster math into
 numpy, rather than having two packages?  Or is this something that is
 only an advantage when the expression is given as a string (and why
 is that the case)?  It would be helpful if you could put the answer
 on your web page and in your standard release blurb in some compact
 form. I guess what I'm really looking for when I read one of those is
 a quick answer to the question should I look into this?.

Well, there is a link in the project page to the Overview section of 
the wiki, but perhaps is a bit hidden.  I've added some blurb as you 
suggested in the main page an another link to the Overview wiki page.
Hope that, by reading the new blurb, you can see why it accelerates 
expression evaluation with regard to NumPy.  If not, tell me and will 
try to come with something more comprehensible.

 Right 
 now, I'm not quite sure whether the problem you are solving is merely
 the case of expressions-in-strings, and there is no advantage for
 expressions-in-code, or whether your expressions-in-strings are
 faster than numpy's expressions-in-code. In either case, it would 
 appear this would be a good addition to the numpy core, and it's past
 1.0, so why keep it separate?  Even if there is value in having a
 non-numpy version, is there not also value in accelerating numpy by
 default?

Having the expression encapsulated in a string has the advantage that 
you exactly know the part of the code that you want to parse and 
accelerate.  Making NumPy to understand parts of the Python code that 
can be accelerated sounds more like a true JIT for Python, and this is 
something that is not trivial at all (although, with the advent of PyPy 
there are appearing some efforts in this direction [1]).

[1] http://www.enthought.com/~ischnell/paper.html

Cheers,

-- 
Francesc Alted
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-16 Thread Sebastian Haase
Hi Francesc,
this is a wonderful project ! I was just wondering if you would /
could support single precision float arrays ?
In 3+D image analysis we generally don't have enough memory to effort
double precision; and we could save our selves lots of extra C coding
(or Cython) coding of we could use numexpr ;-)

Thanks,
Sebastian Haase



On Fri, Jan 16, 2009 at 5:04 PM, Francesc Alted fal...@pytables.org wrote:
 A Friday 16 January 2009, j...@physics.ucf.edu escrigué:
 Hi Francesc,

  Numexpr is a fast numerical expression evaluator for NumPy.  With
  it, expressions that operate on arrays (like 3*a+4*b) are
  accelerated and use less memory than doing the same calculation in
  Python.

 Please pardon my ignorance as I know this project has been around for
 a while.  It this looks very exciting, but either it's cumbersome, or
 I'm not understanding exactly what's being fixed.  If you can
 accelerate evaluation, why not just integrate the faster math into
 numpy, rather than having two packages?  Or is this something that is
 only an advantage when the expression is given as a string (and why
 is that the case)?  It would be helpful if you could put the answer
 on your web page and in your standard release blurb in some compact
 form. I guess what I'm really looking for when I read one of those is
 a quick answer to the question should I look into this?.

 Well, there is a link in the project page to the Overview section of
 the wiki, but perhaps is a bit hidden.  I've added some blurb as you
 suggested in the main page an another link to the Overview wiki page.
 Hope that, by reading the new blurb, you can see why it accelerates
 expression evaluation with regard to NumPy.  If not, tell me and will
 try to come with something more comprehensible.

 Right
 now, I'm not quite sure whether the problem you are solving is merely
 the case of expressions-in-strings, and there is no advantage for
 expressions-in-code, or whether your expressions-in-strings are
 faster than numpy's expressions-in-code. In either case, it would
 appear this would be a good addition to the numpy core, and it's past
 1.0, so why keep it separate?  Even if there is value in having a
 non-numpy version, is there not also value in accelerating numpy by
 default?

 Having the expression encapsulated in a string has the advantage that
 you exactly know the part of the code that you want to parse and
 accelerate.  Making NumPy to understand parts of the Python code that
 can be accelerated sounds more like a true JIT for Python, and this is
 something that is not trivial at all (although, with the advent of PyPy
 there are appearing some efforts in this direction [1]).

 [1] http://www.enthought.com/~ischnell/paper.html

 Cheers,

 --
 Francesc Alted
 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-16 Thread Francesc Alted
A Friday 16 January 2009, Gregor Thalhammer escrigué:
 I also gave a try to the vector math library (VML), contained in
 Intel's Math Kernel Library. This offers a fast implementation of
 mathematical functions, operating on array. First I implemented a C
 extension, providing new ufuncs. This gave me a big performance gain,
 e.g., 2.3x (5x) for sin, 6x (10x) for exp, 7x (15x) for pow, and 3x
 (6x) for division (no gain for add, sub, mul).

Wow, pretty nice speed-ups indeed!  In fact I was thinking in including 
support for threading in Numexpr (I don't think it would be too 
difficult, but let's see).  BTW, do you know how VML is able to achieve 
a speedup of 6x for a sin() function?  I suppose this is because they 
are using SSE instructions, but, are these also available for 64-bit 
double precision items?

 The values in 
 parantheses are given if I allow VML to use several threads and to
 employ both cores of my Intel Core2Duo computer. For large arrays
 (100M entries) this performance gain is reduced because of limited
 memory bandwidth. At this point I was stumbling across numexpr and
 modified it to use the VML functions. For sufficiently long and
 complex  numerical expressions I could  get the maximum performance
 also for large arrays.

Cool.

 Together with VML numexpr seems to be a 
 extremely powerful to get an optimum performance. I would like to see
 numexpr extended to (optionally) make use of fast vectorized math
 functions.

Well, if you can provide the code, I'd be glad to include it in numexpr.  
The only requirement is that the VML must be optional during the build 
of the package.

 There is one but: VML supports (at the moment) only math 
 on contiguous arrays. At a first try I didn't understand how to
 enforce this limitation in numexpr.

No problem.  At the end of the numexpr/necompiler.py you will see some 
code like:

# All the opcodes can deal with strided arrays directly as
# long as they are undimensional (strides in other
# dimensions are dealt within the extension), so we don't
# need a copy for the strided case.
if not b.flags.aligned:
   ...

which you can replace with something like:

# need a copy for the strided case.
if VML_available and not b.flags.contiguous: 
b = b.copy()
elif not b.flags.aligned:
...

That would be enough for ensuring that all the arrays are contiguous 
when they hit numexpr's virtual machine.

Being said this, it is a shame that VML does not have support for 
strided/unaligned arrays.  They are quite common beasts, specially when 
you work with heterogeneous arrays (aka record arrays).

 I also gave a quick try to the 
 equivalent vector math library, acml_mv of AMD. I only tried sin and
 log, gave me the same performance (on a Intel processor!) like Intels
 VML .

 I was also playing around with the block size in numexpr. What are
 the rationale that led to the current block size of 128? Especially
 with VML, a larger block size of 4096 instead of 128 allowed  to
 efficiently use multithreading in VML.

Experimentation.  Back in 2006 David found that 128 was optimal for the 
processors available by that time.  With the Numexpr 1.1 my experiments 
show that 256 is a better value for current Core2 processors and most 
expressions in our benchmark bed (see benchmarks/ directory); hence, 
256 is the new value for the chunksize in 1.1.  However, be in mind 
that 256 has to be multiplied by the itemsize of each array, so the 
chunksize is currently 2048 bytes for 64-bit items (int64 or float64) 
and 4096 for double precision complex arrays, which are probably the 
sizes that have to be compared with VML.


  Share your experience
  =
 
  Let us know of any bugs, suggestions, gripes, kudos, etc. you may
  have.

 I was missing the support for single precision floats.

Yeah.  This is because nobody has implemented it before, but it is 
completely doable.

 Great work!

You are welcome!  And thanks for excellent feedback too!  Hope we can 
have a VML-aware numexpr anytime soon ;-)

Cheers,

-- 
Francesc Alted
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-16 Thread Francesc Alted
A Friday 16 January 2009, Sebastian Haase escrigué:
 Hi Francesc,
 this is a wonderful project ! I was just wondering if you would /
 could support single precision float arrays ?

As I said before, it is doable, but I don't know if I will have time 
enough to implement this myself.

 In 3+D image analysis we generally don't have enough memory to effort
 double precision; and we could save our selves lots of extra C coding
 (or Cython) coding of we could use numexpr ;-)

Well, one of the ideas that I'm toying long time ago is to provide the 
capability to Numexpr to work with PyTables disk-based objects.  That 
way, you would be able to evaluate potentially complex expressions by 
using data that is completely on-disk.  But this might be a completely 
different thing of what you are talking about.

Cheers,

-- 
Francesc Alted
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-16 Thread Dag Sverre Seljebotn
Francesc Alted wrote:
 A Friday 16 January 2009, j...@physics.ucf.edu escrigué:
 Right 
 now, I'm not quite sure whether the problem you are solving is merely
 the case of expressions-in-strings, and there is no advantage for
 expressions-in-code, or whether your expressions-in-strings are
 faster than numpy's expressions-in-code. In either case, it would 
 appear this would be a good addition to the numpy core, and it's past
 1.0, so why keep it separate?  Even if there is value in having a
 non-numpy version, is there not also value in accelerating numpy by
 default?
 
 Having the expression encapsulated in a string has the advantage that 
 you exactly know the part of the code that you want to parse and 
 accelerate.  Making NumPy to understand parts of the Python code that 
 can be accelerated sounds more like a true JIT for Python, and this is 
 something that is not trivial at all (although, with the advent of PyPy 
 there are appearing some efforts in this direction [1]).

A full compiler/JIT isn't needed, there's another route:

One could use the Numexpr methodology together with a symbolic 
expression framework (like SymPy or the one in Sage). I.e. operator 
overloads and lazy expressions.

Combining NumExpr with a symbolic manipulation engine would be very cool 
IMO. Unfortunately I don't have time myself (and I understand that you 
don't, I'm just mentioning it).

Example using psuedo-Sage-like syntax:

a = np.arange(bignum)
b = np.arange(bignum)
x, y = sage.var(x, y)
expr = sage.integrate(x + y, x)
z = expr(x=a, y=b) # z = a**2/2 + b, but Numexpr-enabled

-- 
Dag Sverre
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-16 Thread Gregor Thalhammer
Francesc Alted schrieb:
 A Friday 16 January 2009, Gregor Thalhammer escrigué:
   
 I also gave a try to the vector math library (VML), contained in
 Intel's Math Kernel Library. This offers a fast implementation of
 mathematical functions, operating on array. First I implemented a C
 extension, providing new ufuncs. This gave me a big performance gain,
 e.g., 2.3x (5x) for sin, 6x (10x) for exp, 7x (15x) for pow, and 3x
 (6x) for division (no gain for add, sub, mul).
 

 Wow, pretty nice speed-ups indeed!  In fact I was thinking in including 
 support for threading in Numexpr (I don't think it would be too 
 difficult, but let's see).  BTW, do you know how VML is able to achieve 
 a speedup of 6x for a sin() function?  I suppose this is because they 
 are using SSE instructions, but, are these also available for 64-bit 
 double precision items?
   
I am not an expert on SSE instructions, but to my knowledge there exist 
(in the Core 2 architecture) no SSE instruction to calculate the sin. 
But it seems to be possible to (approximately) calculate a sin with a 
couple of multiplication/ addition instructions (and they exist in SSE 
for 64-bit float). Intel (and AMD) seems to use a more clever algorithm, 
efficiently implemented than the standard implementation.
 Well, if you can provide the code, I'd be glad to include it in numexpr.  
 The only requirement is that the VML must be optional during the build 
 of the package.
   
Yes, I will try to provide you with a polished version of my changes, 
making them optional.
   
 There is one but: VML supports (at the moment) only math 
 on contiguous arrays. At a first try I didn't understand how to
 enforce this limitation in numexpr.
 

 No problem.  At the end of the numexpr/necompiler.py you will see some 
 code like:

 # All the opcodes can deal with strided arrays directly as
 # long as they are undimensional (strides in other
 # dimensions are dealt within the extension), so we don't
 # need a copy for the strided case.
 if not b.flags.aligned:
...

 which you can replace with something like:

 # need a copy for the strided case.
 if VML_available and not b.flags.contiguous: 
   b = b.copy()
 elif not b.flags.aligned:
   ...

 That would be enough for ensuring that all the arrays are contiguous 
 when they hit numexpr's virtual machine.
   
Ah I see, that's not difficult. I thought copying is done in the virtual 
machine. (didn't read all the code ...)
 Being said this, it is a shame that VML does not have support for 
 strided/unaligned arrays.  They are quite common beasts, specially when 
 you work with heterogeneous arrays (aka record arrays).
   
I have the impression that you can already feel happy if these 
mathematical libraries support a C interface, not only Fortran. At least 
the Intel VML provides functions to pack/unpack strided arrays which 
seem work on a broader parameter range than specified (also zero or 
negative step sizes).
 I also gave a quick try to the 
 equivalent vector math library, acml_mv of AMD. I only tried sin and
 log, gave me the same performance (on a Intel processor!) like Intels
 VML .

 I was also playing around with the block size in numexpr. What are
 the rationale that led to the current block size of 128? Especially
 with VML, a larger block size of 4096 instead of 128 allowed  to
 efficiently use multithreading in VML.
 

 Experimentation.  Back in 2006 David found that 128 was optimal for the 
 processors available by that time.  With the Numexpr 1.1 my experiments 
 show that 256 is a better value for current Core2 processors and most 
 expressions in our benchmark bed (see benchmarks/ directory); hence, 
 256 is the new value for the chunksize in 1.1.  However, be in mind 
 that 256 has to be multiplied by the itemsize of each array, so the 
 chunksize is currently 2048 bytes for 64-bit items (int64 or float64) 
 and 4096 for double precision complex arrays, which are probably the 
 sizes that have to be compared with VML.
   
So the optimum block size  might depend  on the type of expression and 
if VML functions are used. On question: the block size is set by a 
#define, is there a significantly poorer performance if you use a 
variable instead? Would be more flexible, especially for testing and tuning.
 I was missing the support for single precision floats.
 

 Yeah.  This is because nobody has implemented it before, but it is 
 completely doable.
   

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-16 Thread Ted Horst
Note that Apple has a similar library called vForce:

http://developer.apple.com/ReleaseNotes/Performance/RN-vecLib/index.html 
 
http://developer.apple.com/documentation/Performance/Conceptual/vecLib/Reference/reference.html
 
 

I think these libraries use several techniques and are not necessarily  
dependent on SSE.  The apple versions appear to only support float and  
double (no complex), and I don't see anything about strided arrays.   
At one point I thought there was talk of adding support for vForce  
into the respective ufuncs.  I don't know if anybody followed up on  
that.

On 2009-01-16, at 10:48, Francesc Alted wrote:

 Wow, pretty nice speed-ups indeed!  In fact I was thinking in  
 including
 support for threading in Numexpr (I don't think it would be too
 difficult, but let's see).  BTW, do you know how VML is able to  
 achieve
 a speedup of 6x for a sin() function?  I suppose this is because they
 are using SSE instructions, but, are these also available for 64-bit
 double precision items?

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-16 Thread Olivier Grisel
2009/1/16 Gregor Thalhammer gregor.thalham...@gmail.com:
 Francesc Alted schrieb:

 Wow, pretty nice speed-ups indeed!  In fact I was thinking in including
 support for threading in Numexpr (I don't think it would be too
 difficult, but let's see).  BTW, do you know how VML is able to achieve
 a speedup of 6x for a sin() function?  I suppose this is because they
 are using SSE instructions, but, are these also available for 64-bit
 double precision items?

 I am not an expert on SSE instructions, but to my knowledge there exist
 (in the Core 2 architecture) no SSE instruction to calculate the sin.
 But it seems to be possible to (approximately) calculate a sin with a
 couple of multiplication/ addition instructions (and they exist in SSE
 for 64-bit float). Intel (and AMD) seems to use a more clever algorithm,

Here is the lib I use for the  transcendental functions SSE implementations:
http://gruntthepeon.free.fr/ssemath/

(only simple precision float though).

-- 
Olivier
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion