Re: [Numpy-discussion] testing with amd libm/acml

2012-11-08 Thread Francesc Alted
On 11/7/12 8:41 PM, Neal Becker wrote:
 Would you expect numexpr without MKL to give a significant boost?

Yes.  Have a look at how numexpr's own multi-threaded virtual machine 
compares with numexpr using VML:

http://code.google.com/p/numexpr/wiki/NumexprVML

As it can be seen, the best results are obtained by using the 
multi-threaded VM in numexpr in combination with a single-threaded VML 
engine.  Caution: I did these benchmarks some time ago (couple of 
years?), so it might be that multi-threaded VML would have improved by 
now.  If performance is critical, some experiments should be done first 
so as to find the optimal configuration.

At any rate, VML will let you to optimally leverage the SIMD 
instructions in the cores, allowing to compute, for example, exp() in 1 
or 2 clock cycles (depending on the vector length, the number of cores 
in your system and the data precision):

http://software.intel.com/sites/products/documentation/hpc/mkl/vml/functions/exp.html

Pretty amazing.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] testing with amd libm/acml

2012-11-08 Thread Francesc Alted
On 11/8/12 12:35 AM, Chris Barker wrote:
 On Wed, Nov 7, 2012 at 11:41 AM, Neal Becker ndbeck...@gmail.com wrote:
 Would you expect numexpr without MKL to give a significant boost?
 It can, depending on the use case:
   -- It can remove a lot of uneccessary temporary creation.
   -- IIUC, it works on blocks of data at a time, and thus can keep
 things in cache more when working with large data sets.

Well, the temporaries are still created, but the thing is that, by 
working with small blocks at a time, these temporaries fit in CPU cache, 
preventing copies into main memory.  I like to name this the 'blocking 
technique', as explained in slide 26 (and following) in:

https://python.g-node.org/wiki/_media/starving_cpu/starving-cpu.pdf

A better technique is to reduce the block size to the minimal expression 
(1 element), so temporaries are stored in registers in CPU instead of 
small blocks in cache, hence preventing copies even in *cache*.  Numba 
(https://github.com/numba/numba) follows this approach, which is pretty 
optimal as can be seen in slide 37 of the lecture above.

-- It can (optionally) use multiple threads for easy parallelization.

No, the *total* amount of cores detected in the system is the default in 
numexpr; if you want less, you will need to use 
set_num_threads(nthreads) function.  But agreed, sometimes using too 
many threads could effectively be counter-producing.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] testing with amd libm/acml

2012-11-08 Thread Dag Sverre Seljebotn
On 11/07/2012 08:41 PM, Neal Becker wrote:
 Would you expect numexpr without MKL to give a significant boost?

If you need higher performance than what numexpr can give without using 
MKL, you could look at code such as this:

https://github.com/herumi/fmath/blob/master/fmath.hpp#L480

But that means going to C (e.g., by wrapping that function in Cython). 
Pay attention to what range you evaluate the function in though (my eyes 
may deceive me but it seems that the test program only test for 
arguments drawn from the standard Gaussian which is a bit limited..)

Dag Sverre
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] testing with amd libm/acml

2012-11-08 Thread Chris Barker
On Thu, Nov 8, 2012 at 2:22 AM, Francesc Alted franc...@continuum.io wrote:

   -- It can remove a lot of uneccessary temporary creation.

 Well, the temporaries are still created, but the thing is that, by
 working with small blocks at a time, these temporaries fit in CPU cache,
 preventing copies into main memory.

hmm -- I thought it was smart enough to remove some unnecessary
temporaries altogether. Shows what I know. But apparently it does,
indeed, avoid creating the full-size temporary arrays.

pretty cool stuff, in any case.

-Chris


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] testing with amd libm/acml

2012-11-08 Thread Francesc Alted
On 11/8/12 1:41 PM, Dag Sverre Seljebotn wrote:
 On 11/07/2012 08:41 PM, Neal Becker wrote:
 Would you expect numexpr without MKL to give a significant boost?
 If you need higher performance than what numexpr can give without using
 MKL, you could look at code such as this:

 https://github.com/herumi/fmath/blob/master/fmath.hpp#L480

Hey, that's cool.  I was a bit disappointed not finding this sort of 
work in open space.  It seems that this lacks threading support, but 
that should be easy to implement by using OpenMP directives.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] testing with amd libm/acml

2012-11-08 Thread Dag Sverre Seljebotn
On 11/08/2012 06:06 PM, Francesc Alted wrote:
 On 11/8/12 1:41 PM, Dag Sverre Seljebotn wrote:
 On 11/07/2012 08:41 PM, Neal Becker wrote:
 Would you expect numexpr without MKL to give a significant boost?
 If you need higher performance than what numexpr can give without using
 MKL, you could look at code such as this:

 https://github.com/herumi/fmath/blob/master/fmath.hpp#L480

 Hey, that's cool.  I was a bit disappointed not finding this sort of
 work in open space.  It seems that this lacks threading support, but
 that should be easy to implement by using OpenMP directives.

IMO this is the wrong place to introduce threading; each thread should 
call expd_v on its chunks. (Which I think is how you said numexpr 
currently uses VML anyway.)

Dag Sverre



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] testing with amd libm/acml

2012-11-08 Thread Francesc Alted
On 11/8/12 6:38 PM, Dag Sverre Seljebotn wrote:
 On 11/08/2012 06:06 PM, Francesc Alted wrote:
 On 11/8/12 1:41 PM, Dag Sverre Seljebotn wrote:
 On 11/07/2012 08:41 PM, Neal Becker wrote:
 Would you expect numexpr without MKL to give a significant boost?
 If you need higher performance than what numexpr can give without using
 MKL, you could look at code such as this:

 https://github.com/herumi/fmath/blob/master/fmath.hpp#L480
 Hey, that's cool.  I was a bit disappointed not finding this sort of
 work in open space.  It seems that this lacks threading support, but
 that should be easy to implement by using OpenMP directives.
 IMO this is the wrong place to introduce threading; each thread should
 call expd_v on its chunks. (Which I think is how you said numexpr
 currently uses VML anyway.)

Oh sure, but then you need a blocked engine for performing the 
computations too.  And yes, by default numexpr uses its own threading 
code rather than the existing one in VML (but that can be changed by 
playing with set_num_threads/set_vml_num_threads).  It always stroked to 
me as a little strange that the internal threading in numexpr was more 
efficient than VML one, but I suppose this is because the latter is more 
optimized to deal with large blocks instead of those of medium size (4K) 
in numexpr.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] testing with amd libm/acml

2012-11-08 Thread Dag Sverre Seljebotn
On 11/08/2012 06:59 PM, Francesc Alted wrote:
 On 11/8/12 6:38 PM, Dag Sverre Seljebotn wrote:
 On 11/08/2012 06:06 PM, Francesc Alted wrote:
 On 11/8/12 1:41 PM, Dag Sverre Seljebotn wrote:
 On 11/07/2012 08:41 PM, Neal Becker wrote:
 Would you expect numexpr without MKL to give a significant boost?
 If you need higher performance than what numexpr can give without using
 MKL, you could look at code such as this:

 https://github.com/herumi/fmath/blob/master/fmath.hpp#L480
 Hey, that's cool.  I was a bit disappointed not finding this sort of
 work in open space.  It seems that this lacks threading support, but
 that should be easy to implement by using OpenMP directives.
 IMO this is the wrong place to introduce threading; each thread should
 call expd_v on its chunks. (Which I think is how you said numexpr
 currently uses VML anyway.)

 Oh sure, but then you need a blocked engine for performing the
 computations too.  And yes, by default numexpr uses its own threading

I just meant that you can use a chunked OpenMP for-loop wherever in your 
code that you call expd_v. A five-line blocked engine, if you like :-)

IMO that's the right location since entering/exiting OpenMP blocks takes 
some time.

 code rather than the existing one in VML (but that can be changed by
 playing with set_num_threads/set_vml_num_threads).  It always stroked to
 me as a little strange that the internal threading in numexpr was more
 efficient than VML one, but I suppose this is because the latter is more
 optimized to deal with large blocks instead of those of medium size (4K)
 in numexpr.

I don't know enough about numexpr to understand this :-)

I guess I just don't see the motivation to use VML threading or why it 
should be faster? If you pass a single 4K block to a threaded VML call 
then I could easily see lots of performance problems: a) 
starting/stopping threads or signalling the threads of a pool is a 
constant overhead per parallel section, b) unless you're very careful 
to only have VML touch the data, and VML always schedules elements in 
the exact same way, you're going to have the cache lines of that 4K 
block shuffled between L1 caches of different cores for different 
operations...

As I said, I'm mostly ignorant about how numexpr works, that's probably 
showing :-)

Dag Sverre
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] testing with amd libm/acml

2012-11-08 Thread Dag Sverre Seljebotn
On 11/08/2012 07:55 PM, Dag Sverre Seljebotn wrote:
 On 11/08/2012 06:59 PM, Francesc Alted wrote:
 On 11/8/12 6:38 PM, Dag Sverre Seljebotn wrote:
 On 11/08/2012 06:06 PM, Francesc Alted wrote:
 On 11/8/12 1:41 PM, Dag Sverre Seljebotn wrote:
 On 11/07/2012 08:41 PM, Neal Becker wrote:
 Would you expect numexpr without MKL to give a significant boost?
 If you need higher performance than what numexpr can give without
 using
 MKL, you could look at code such as this:

 https://github.com/herumi/fmath/blob/master/fmath.hpp#L480
 Hey, that's cool.  I was a bit disappointed not finding this sort of
 work in open space.  It seems that this lacks threading support, but
 that should be easy to implement by using OpenMP directives.
 IMO this is the wrong place to introduce threading; each thread should
 call expd_v on its chunks. (Which I think is how you said numexpr
 currently uses VML anyway.)

 Oh sure, but then you need a blocked engine for performing the
 computations too.  And yes, by default numexpr uses its own threading

 I just meant that you can use a chunked OpenMP for-loop wherever in your
 code that you call expd_v. A five-line blocked engine, if you like :-)

 IMO that's the right location since entering/exiting OpenMP blocks takes
 some time.

 code rather than the existing one in VML (but that can be changed by
 playing with set_num_threads/set_vml_num_threads).  It always stroked to
 me as a little strange that the internal threading in numexpr was more
 efficient than VML one, but I suppose this is because the latter is more
 optimized to deal with large blocks instead of those of medium size (4K)
 in numexpr.

 I don't know enough about numexpr to understand this :-)

 I guess I just don't see the motivation to use VML threading or why it
 should be faster? If you pass a single 4K block to a threaded VML call
 then I could easily see lots of performance problems: a)
 starting/stopping threads or signalling the threads of a pool is a
 constant overhead per parallel section, b) unless you're very careful
 to only have VML touch the data, and VML always schedules elements in
 the exact same way, you're going to have the cache lines of that 4K
 block shuffled between L1 caches of different cores for different
 operations...

c) Your effective block size is then 4KB/ncores.

(Unless you scale the block size by ncores).

DS
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] testing with amd libm/acml

2012-11-08 Thread Francesc Alted
On 11/8/12 7:55 PM, Dag Sverre Seljebotn wrote:
 On 11/08/2012 06:59 PM, Francesc Alted wrote:
 On 11/8/12 6:38 PM, Dag Sverre Seljebotn wrote:
 On 11/08/2012 06:06 PM, Francesc Alted wrote:
 On 11/8/12 1:41 PM, Dag Sverre Seljebotn wrote:
 On 11/07/2012 08:41 PM, Neal Becker wrote:
 Would you expect numexpr without MKL to give a significant boost?
 If you need higher performance than what numexpr can give without using
 MKL, you could look at code such as this:

 https://github.com/herumi/fmath/blob/master/fmath.hpp#L480
 Hey, that's cool.  I was a bit disappointed not finding this sort of
 work in open space.  It seems that this lacks threading support, but
 that should be easy to implement by using OpenMP directives.
 IMO this is the wrong place to introduce threading; each thread should
 call expd_v on its chunks. (Which I think is how you said numexpr
 currently uses VML anyway.)
 Oh sure, but then you need a blocked engine for performing the
 computations too.  And yes, by default numexpr uses its own threading
 I just meant that you can use a chunked OpenMP for-loop wherever in your
 code that you call expd_v. A five-line blocked engine, if you like :-)

 IMO that's the right location since entering/exiting OpenMP blocks takes
 some time.

Yes, I meant precisely this first hand.

 code rather than the existing one in VML (but that can be changed by
 playing with set_num_threads/set_vml_num_threads).  It always stroked to
 me as a little strange that the internal threading in numexpr was more
 efficient than VML one, but I suppose this is because the latter is more
 optimized to deal with large blocks instead of those of medium size (4K)
 in numexpr.
 I don't know enough about numexpr to understand this :-)

 I guess I just don't see the motivation to use VML threading or why it
 should be faster? If you pass a single 4K block to a threaded VML call
 then I could easily see lots of performance problems: a)
 starting/stopping threads or signalling the threads of a pool is a
 constant overhead per parallel section, b) unless you're very careful
 to only have VML touch the data, and VML always schedules elements in
 the exact same way, you're going to have the cache lines of that 4K
 block shuffled between L1 caches of different cores for different
 operations...

 As I said, I'm mostly ignorant about how numexpr works, that's probably
 showing :-)

No, on the contrary, you rather hit the core of the issue (or part of 
it).  On one hand, VML needs large blocks in order to maximize the 
performance of the pipeline and in the other hand numexpr tries to 
minimize block size in order to make temporaries as small as possible 
(so avoiding the use of the higher level caches).  From this tension 
(and some benchmarking work) the size of 4K (btw, this is the number of 
*elements*, so the size is actually either 16 KB and 32 KB for single 
and double precision respectively) was derived.  Incidentally, for 
numexpr with no VML support, the size is reduced to 1K elements (and 
perhaps it could be reduced a bit more, but anyways).

Anyway, this is way too low level to be discussed here, although we can 
continue on the numexpr list if you are interested in more details.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] testing with amd libm/acml

2012-11-07 Thread David Cournapeau
On Wed, Nov 7, 2012 at 12:35 PM, Neal Becker ndbeck...@gmail.com wrote:
 I'm trying to do a bit of benchmarking to see if amd libm/acml will help me.

 I got an idea that instead of building all of numpy/scipy and all of my custom
 modules against these libraries, I could simply use:

 LD_PRELOAD=/opt/amdlibm-3.0.2/lib/dynamic/libamdlibm.so:/opt/acml5.2.0/gfortran64/lib/libacml.so
 my program here

 I'm hoping that both numpy and my own dll's then will take advantage of these
 libraries.

 Do you think this will work?

Quite unlikely depending on your configuration, because those
libraries are rarely if ever ABI compatible (that's why it is such a
pain to support).

David
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] testing with amd libm/acml

2012-11-07 Thread Neal Becker
David Cournapeau wrote:

 On Wed, Nov 7, 2012 at 12:35 PM, Neal Becker ndbeck...@gmail.com wrote:
 I'm trying to do a bit of benchmarking to see if amd libm/acml will help me.

 I got an idea that instead of building all of numpy/scipy and all of my
 custom modules against these libraries, I could simply use:

 
LD_PRELOAD=/opt/amdlibm-3.0.2/lib/dynamic/libamdlibm.so:/opt/acml5.2.0/gfortran64/lib/libacml.so
 my program here

 I'm hoping that both numpy and my own dll's then will take advantage of these
 libraries.

 Do you think this will work?
 
 Quite unlikely depending on your configuration, because those
 libraries are rarely if ever ABI compatible (that's why it is such a
 pain to support).
 
 David

When you say quite unlikely (to work), you mean 

a) unlikely that libm/acml will be used to resolve symbols in numpy/dlls at 
runtime (e.g., exp)?

or 

b) program may produce wrong results and/or crash ?

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] testing with amd libm/acml

2012-11-07 Thread David Cournapeau
On Wed, Nov 7, 2012 at 1:56 PM, Neal Becker ndbeck...@gmail.com wrote:
 David Cournapeau wrote:

 On Wed, Nov 7, 2012 at 12:35 PM, Neal Becker ndbeck...@gmail.com wrote:
 I'm trying to do a bit of benchmarking to see if amd libm/acml will help me.

 I got an idea that instead of building all of numpy/scipy and all of my
 custom modules against these libraries, I could simply use:


 LD_PRELOAD=/opt/amdlibm-3.0.2/lib/dynamic/libamdlibm.so:/opt/acml5.2.0/gfortran64/lib/libacml.so
 my program here

 I'm hoping that both numpy and my own dll's then will take advantage of 
 these
 libraries.

 Do you think this will work?

 Quite unlikely depending on your configuration, because those
 libraries are rarely if ever ABI compatible (that's why it is such a
 pain to support).

 David

 When you say quite unlikely (to work), you mean

 a) unlikely that libm/acml will be used to resolve symbols in numpy/dlls at
 runtime (e.g., exp)?

 or

 b) program may produce wrong results and/or crash ?

Both, actually. That's not something I would use myself. Did you try
openblas ? It is open source, simple to build, and is pretty fast,

David
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] testing with amd libm/acml

2012-11-07 Thread Neal Becker
David Cournapeau wrote:

 On Wed, Nov 7, 2012 at 1:56 PM, Neal Becker ndbeck...@gmail.com wrote:
 David Cournapeau wrote:

 On Wed, Nov 7, 2012 at 12:35 PM, Neal Becker ndbeck...@gmail.com wrote:
 I'm trying to do a bit of benchmarking to see if amd libm/acml will help
 me.

 I got an idea that instead of building all of numpy/scipy and all of my
 custom modules against these libraries, I could simply use:


 
LD_PRELOAD=/opt/amdlibm-3.0.2/lib/dynamic/libamdlibm.so:/opt/acml5.2.0/gfortran64/lib/libacml.so
 my program here

 I'm hoping that both numpy and my own dll's then will take advantage of
 these libraries.

 Do you think this will work?

 Quite unlikely depending on your configuration, because those
 libraries are rarely if ever ABI compatible (that's why it is such a
 pain to support).

 David

 When you say quite unlikely (to work), you mean

 a) unlikely that libm/acml will be used to resolve symbols in numpy/dlls at
 runtime (e.g., exp)?

 or

 b) program may produce wrong results and/or crash ?
 
 Both, actually. That's not something I would use myself. Did you try
 openblas ? It is open source, simple to build, and is pretty fast,
 
 David

Actually, for my current work, I'm more concerned with speeding up operations 
such as exp, log and basic vector arithmetic.  Any thoughts on that?

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] testing with amd libm/acml

2012-11-07 Thread Neal Becker
David Cournapeau wrote:

 On Wed, Nov 7, 2012 at 1:56 PM, Neal Becker ndbeck...@gmail.com wrote:
 David Cournapeau wrote:

 On Wed, Nov 7, 2012 at 12:35 PM, Neal Becker ndbeck...@gmail.com wrote:
 I'm trying to do a bit of benchmarking to see if amd libm/acml will help
 me.

 I got an idea that instead of building all of numpy/scipy and all of my
 custom modules against these libraries, I could simply use:


 
LD_PRELOAD=/opt/amdlibm-3.0.2/lib/dynamic/libamdlibm.so:/opt/acml5.2.0/gfortran64/lib/libacml.so
 my program here

 I'm hoping that both numpy and my own dll's then will take advantage of
 these libraries.

 Do you think this will work?

 Quite unlikely depending on your configuration, because those
 libraries are rarely if ever ABI compatible (that's why it is such a
 pain to support).

 David

 When you say quite unlikely (to work), you mean

 a) unlikely that libm/acml will be used to resolve symbols in numpy/dlls at
 runtime (e.g., exp)?

 or

 b) program may produce wrong results and/or crash ?
 
 Both, actually. That's not something I would use myself. Did you try
 openblas ? It is open source, simple to build, and is pretty fast,
 
 David

In my current work, probably the largest bottlenecks are 'max*',  which are

log (\sum e^(x_i))


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] testing with amd libm/acml

2012-11-07 Thread Dag Sverre Seljebotn
On 11/07/2012 03:30 PM, Neal Becker wrote:
 David Cournapeau wrote:

 On Wed, Nov 7, 2012 at 1:56 PM, Neal Becker ndbeck...@gmail.com wrote:
 David Cournapeau wrote:

 On Wed, Nov 7, 2012 at 12:35 PM, Neal Becker ndbeck...@gmail.com wrote:
 I'm trying to do a bit of benchmarking to see if amd libm/acml will help
 me.

 I got an idea that instead of building all of numpy/scipy and all of my
 custom modules against these libraries, I could simply use:



 LD_PRELOAD=/opt/amdlibm-3.0.2/lib/dynamic/libamdlibm.so:/opt/acml5.2.0/gfortran64/lib/libacml.so
 my program here

 I'm hoping that both numpy and my own dll's then will take advantage of
 these libraries.

 Do you think this will work?

 Quite unlikely depending on your configuration, because those
 libraries are rarely if ever ABI compatible (that's why it is such a
 pain to support).

 David

 When you say quite unlikely (to work), you mean

 a) unlikely that libm/acml will be used to resolve symbols in numpy/dlls at
 runtime (e.g., exp)?

 or

 b) program may produce wrong results and/or crash ?

 Both, actually. That's not something I would use myself. Did you try
 openblas ? It is open source, simple to build, and is pretty fast,

 David

 In my current work, probably the largest bottlenecks are 'max*',  which are

 log (\sum e^(x_i))

numexpr with Intel VML is the solution I know of that doesn't require 
you to dig into compiling C code yourself. Did you look into that or is 
using Intel VML/MKL not an option?

Fast exps depend on the CPU evaluating many exp's at the same time (both 
explicit through vector registers, and implicit through pipelining); 
even if you get what you try to work (which is unlikely I think) the 
approach is inherently slow, since just passing a single number at the 
time through the exp function can't be efficient.

Dag Sverre
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] testing with amd libm/acml

2012-11-07 Thread Neal Becker
Would you expect numexpr without MKL to give a significant boost?

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] testing with amd libm/acml

2012-11-07 Thread Chris Barker
On Wed, Nov 7, 2012 at 11:41 AM, Neal Becker ndbeck...@gmail.com wrote:
 Would you expect numexpr without MKL to give a significant boost?

It can, depending on the use case:
 -- It can remove a lot of uneccessary temporary creation.
 -- IIUC, it works on blocks of data at a time, and thus can keep
things in cache more when working with large data sets.
  -- It can (optionally) use multiple threads for easy parallelization.

All you can do is try it on your use-case and see what you get. It's a
pretty light lift to try.

-Chris






-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion