Re: [Numpy-discussion] Changing FFT cache to a bounded LRU cache

2016-06-01 Thread Gregor Thalhammer

> Am 31.05.2016 um 23:36 schrieb Sturla Molden :
> 
> Joseph Martinot-Lagarde  wrote:
> 
>> The problem with FFTW is that its license is more restrictive (GPL), and
>> because of this may not be suitable everywhere numpy.fft is.
> 
> A lot of us use NumPy linked with MKL or Accelerate, both of which have
> some really nifty FFTs. And the license issue is hardly any worse than
> linking with them for BLAS and LAPACK, which we do anyway. We could extend
> numpy.fft to use MKL or Accelerate when they are available.

It seems the anaconda numpy binaries do already use MKL for fft:

In [2]: np.fft.using_mklfft
Out[2]: True

Is this based on a proprietary patch of numpy?

Gregor 

> 
> Sturla
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Numexpr-3.0 proposal

2016-02-15 Thread Gregor Thalhammer

> Am 14.02.2016 um 23:19 schrieb Robert McLeod :
> 
> Hello everyone,
> 
> I've done some work on making a new version of Numexpr that would fix some of 
> the limitations of the original virtual machine with regards to data types 
> and operation/function count. Basically I re-wrote the Python and C sides to 
> use 4-byte words, instead of null-terminated strings, for operations and 
> passing types.  This means the number of operations and types isn't 
> significantly limited anymore.
> 
> Francesc Alted suggested I should come here and get some advice from the 
> community. I wrote a short proposal on the Wiki here:
> 
> https://github.com/pydata/numexpr/wiki/Numexpr-3.0-Branch-Overview 
> 
> 
> One can see my branch here:
> 
> https://github.com/robbmcleod/numexpr/tree/numexpr-3.0 
> 
> 
> If anyone has any comments they'd be welcome. Questions from my side for the 
> group:
> 
> 1.) Numpy casting: I downloaded the Numpy source and after browsing it seems 
> the best approach is probably to just use 
> numpy.core.numerictypes.find_common_type?
> 
> 2.) Can anyone foresee any issues with casting build-in Python types (i.e. 
> float and integer) to their OS dependent numpy equivalents? Numpy already 
> seems to do this. 
> 
> 3.) Is anyone enabling the Intel VML library? There are a number of comments 
> in the code that suggest it's not accelerating the code. It also seems to 
> cause problems with bundling numexpr with cx_freeze.
> 
Dear Robert,

thanks for your effort on improving numexpr. Indeed, vectorized math libraries 
(VML) can give a large boost in performance (~5x), except for a couple of basic 
operations (add, mul, div), which current compilers are able to vectorize 
automatically. With recent gcc even more functions are vectorized, see 
https://sourceware.org/glibc/wiki/libmvec 
 But you need special flags 
depending on the platform (SSE, AVX present?), runtime detection of processor 
capabilities would be nice for distributing binaries. Some time ago, since I 
lost access to Intels MKL, I patched numexpr to use Accelerate/Veclib on os x, 
which is preinstalled on each mac, see https://github.com/geggo/numexpr.git 
 veclib_support branch.

As you increased the opcode size, I could imagine providing a bit to switch 
(during runtime) between internal functions and vectorized ones, that would be 
handy for tests and benchmarks.

Gregor

> 4.) I took a stab at converting from distutils to setuputils but this seems 
> challenging with numpy as a dependency. I wonder if anyone has tried 
> monkey-patching so that setup.py build_ext uses distutils and then pass the 
> interpreter.pyd/so as a data file, or some other such chicanery?   
> 
> (I was going to ask about attaching a debugger, but I just noticed: 
> https://wiki.python.org/moin/DebuggingWithGdb 
>    )
> 
> Ciao,
> 
> Robert
> 
> -- 
> Robert McLeod, Ph.D.
> Center for Cellular Imaging and Nano Analytics (C-CINA)
> Biozentrum der Universität Basel
> Mattenstrasse 26, 4058 Basel
> Work: +41.061.387.3225 
> robert.mcl...@unibas.ch 
> robert.mcl...@bsse.ethz.ch 
> robbmcl...@gmail.com 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Linking other libm-Implementation

2016-02-09 Thread Gregor Thalhammer

> Am 09.02.2016 um 11:21 schrieb Nils Becker <nilsc.bec...@gmail.com>:
> 
> 2016-02-08 18:54 GMT+01:00 Julian Taylor <jtaylor.deb...@googlemail.com 
> <mailto:jtaylor.deb...@googlemail.com>>:
> > which version of glibm was used here? There are significant difference
> > in performance between versions.
> > Also the input ranges are very important for these functions, depending
> > on input the speed of these functions can vary by factors of 1000.
> >
> > glibm now includes vectorized versions of most math functions, does
> > openlibm have vectorized math?
> > Thats where most speed can be gained, a lot more than 25%.
> 
> glibc 2.22 was used running on archlinux. As far as I know openlibm does not 
> include special vectorized functions. (for reference vectorized operations in 
> glibc: https://sourceware.org/glibc/wiki/libmvec 
> <https://sourceware.org/glibc/wiki/libmvec>).
> 
> 2016-02-08 23:32 GMT+01:00 Gregor Thalhammer <gregor.thalham...@gmail.com 
> <mailto:gregor.thalham...@gmail.com>>:
> Years ago I made the vectorized math functions from Intels Vector Math 
> Library (VML), part of MKL, available for numpy, see 
> https://github.com/geggo/uvml <https://github.com/geggo/uvml>
> Not particularly difficult, you not even have to change numpy. For some cases 
> (e.g., exp) I have seen speedups up to 5x-10x. Unfortunately MKL is not free, 
> and free vector math libraries like yeppp implement much fewer functions or 
> do not support the required strided memory layout. But to improve 
> performance, numexpr, numba or theano are much better.
> 
> Gregor
> 
> 
> Thank you very much for the link! I did not know about numpy.set_numeric_ops.
> You are right, vectorized operations can push down calculation time per 
> element by factors. The benchmarks done for the yeppp-project also indicate 
> that (however far you would trust them: http://www.yeppp.info/benchmarks.html 
> <http://www.yeppp.info/benchmarks.html>). But I would agree that this domain 
> should be left to specialized tools like numexpr as fully exploiting the 
> speedup depends on the expression, that should be calculated. It is not 
> suitable as a standard for bumpy.

Why should numpy not provide fast transcendental math functions? For linear 
algebra it supports fast implementations, even non-free (MKL). Wouldn’t it be 
nice if numpy outperforms C?

> 
> Still, I think it would be good to give the possibility to choose the libm 
> numpy links against. And be it simply to allow to choose or guarantee a 
> specific accuracy/performance on different platforms and systems.
> Maybe maintaining a de-facto libm in npy_math could be replaced with a 
> dependency on e.g. openlibm. But such a decision would require a thorough 
> benchmark/testing of the available solutions. Especially with respect to the 
> accuracy-performance-tradeoff that was mentioned.
> 

Intel publishes accuracy/performance charts for VML/MKL:
https://software.intel.com/sites/products/documentation/doclib/mkl/vm/functions/_accuracyall.html
 
<https://software.intel.com/sites/products/documentation/doclib/mkl/vm/functions/_accuracyall.html>

For GNU libc it is more difficult to find similarly precise data, I only could 
find:
http://www.gnu.org/software/libc/manual/html_node/Errors-in-Math-Functions.html 
<http://www.gnu.org/software/libc/manual/html_node/Errors-in-Math-Functions.html>

Gregor


> Cheers
> Nils
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Linking other libm-Implementation

2016-02-08 Thread Gregor Thalhammer

> Am 08.02.2016 um 18:36 schrieb Nathaniel Smith :
> 
> On Feb 8, 2016 3:04 AM, "Nils Becker"  > wrote:
> >
> [...]
> > Very superficial benchmarks (see below) seem devastating for gnu libm. It 
> > seems that openlibm (compiled with gcc -mtune=native -O3) performs really 
> > well and intels libm implementation is the best (on my intel CPU). I did 
> > not check the accuracy of the functions, though.
> >
> > My own code uses a lot of trigonometric and complex functions (optics 
> > calculations). I'd guess it could go 25% faster by just using a better libm 
> > implementation. Therefore, I have an interest in getting sane linking to a 
> > defined libm implementation to work. 
> 
> On further thought: I guess that to do this we actually will need to change 
> the names of the functions in openlibm and then use those names when calling 
> from numpy. So long as we're using the regular libm symbol names, it doesn't 
> matter what library the python extensions themselves are linked to; the way 
> ELF symbol lookup works, the libm that the python interpreter is linked to 
> will be checked *before* checking the libm that numpy is linked to, so the 
> symbols will all get shadowed.
> 
> I guess statically linking openlibm would also work, but not sure that's a 
> great idea since we'd need it multiple places.
> 
> > Apparently openlibm seems quite a good choice for numpy, at least 
> > performance wise. However, I did not find any documentation or tests of the 
> > accuracy of its functions. A benchmarking and testing (for accuracy) code 
> > for libms would probably be a good starting point for a discussion. I could 
> > maybe help with that - but apparently not with any linking/building stuff 
> > (I just don't get it).
> >
> > Benchmark:
> >
> > gnu libm.so
> > 3000 x sin(double[10]):  6.68215647800389 s
> > 3000 x log(double[10]):  8.86350397899514 s
> > 3000 x exp(double[10]):  6.560557693999726 s
> >
> > openlibm.so
> > 3000 x sin(double[10]):  4.5058218560006935 s
> > 3000 x log(double[10]):  4.106520485998772 s
> > 3000 x exp(double[10]):  4.597905882001214 s
> >
> > Intel libimf.so
> > 3000 x sin(double[10]):  4.282402812998043 s
> > 3000 x log(double[10]):  4.008453270995233 s
> > 3000 x exp(double[10]):  3.30127963848 s
> 
> I would be highly suspicious that this speed comes at the expense of 
> accuracy... My impression is that there's a lot of room to make 
> speed/accuracy tradeoffs in these functions, and modern glibc's libm has seen 
> a fair amount of scrutiny by people who have access to the same code that 
> openlibm is based off of. But then again, maybe not :-).
> 
> If these are the operations that you care about optimizing, an even better 
> approach might be to figure out how to integrate a vector math library here 
> like yeppp (BSD licensed) or MKL. Libm tries to optimize log(scalar); these 
> are libraries that specifically try to optimize log(vector). Adding this 
> would require changing numpy's code to use these new APIs though. (Very new 
> gcc can also try to do this in some cases but I don't know how good at it it 
> is... Julian might.)
> 
Years ago I made the vectorized math functions from Intels Vector Math Library 
(VML), part of MKL, available for numpy, see https://github.com/geggo/uvml
Not particularly difficult, you not even have to change numpy. For some cases 
(e.g., exp) I have seen speedups up to 5x-10x. Unfortunately MKL is not free, 
and free vector math libraries like yeppp implement much fewer functions or do 
not support the required strided memory layout. But to improve performance, 
numexpr, numba or theano are much better.

Gregor

> -n
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org 
> https://mail.scipy.org/mailman/listinfo/numpy-discussion 
> 
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Introductory mail and GSoc Project Vector math library integration

2015-03-24 Thread Gregor Thalhammer

 Am 12.03.2015 um 13:48 schrieb Julian Taylor jtaylor.deb...@googlemail.com:
 
 On 03/12/2015 10:15 AM, Gregor Thalhammer wrote:
 
 Another note, numpy makes it easy to provide new ufuncs, see 
 http://docs.scipy.org/doc/numpy-dev/user/c-info.ufunc-tutorial.html
 from a C function that operates on 1D arrays, but this function needs to
 support arbitrary spacing (stride) between the items. Unfortunately, to
 achieve good performance, vector math libraries often expect that the
 items are laid out contiguously in memory. MKL/VML is a notable
 exception. So for non contiguous in- or output arrays you might need to
 copy the data to a buffer, which likely kills large amounts of the
 performance gain.
 
 The elementary functions are very slow even compared to memory access,
 they take in the orders of hundreds to tens of thousand cycles to
 complete (depending on range and required accuracy).
 Even in the case of strided access that gives the hardware prefetchers
 plenty of time to load the data before the previous computation is done.
 

That might apply to the mathematical functions from the standard libraries, but 
that is not true for the optimized libraries. Typical numbers are 4-10 CPU 
cycles per operation, see e.g. 
https://software.intel.com/sites/products/documentation/doclib/mkl_sa/112/vml/functions/_performanceall.html

The benchmarks at https://github.com/geggo/uvml https://github.com/geggo/uvml 
show that memory access to main memory limits the performance for the 
calculation of exp for large array sizes . This test was done quite some time 
ago, memory bandwidth now typically is higher, but also computational power.


 This also removes the requirement from the library to provide a strided
 api, we can copy the strided data into a contiguous buffer and pass it
 to the library without losing much performance. It may not be optimal
 (e.g. a library can fine tune the prefetching better for the case where
 the hardware is not ideal) but most likely sufficient.

Copying the data to a small enough buffer so it fits into cache might add a few 
cycles, this already impacts performance significantly. Curious to see how much.

Gregor

 
 Figuring out how to best do it to get the best performance and still
 being flexible in what implementation is used is part of the challenge
 the student will face for this project.
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Introductory mail and GSoc Project Vector math library integration

2015-03-12 Thread Gregor Thalhammer

 Am 11.03.2015 um 23:18 schrieb Dp Docs sdpa...@gmail.com:
 
 
 
 On Wed, Mar 11, 2015 at 10:34 PM, Gregor Thalhammer 
 gregor.thalham...@gmail.com mailto:gregor.thalham...@gmail.com wrote:
 
 
  On the scipy mailing list I also answered to Amine, who is also interested 
  in this proposal.
 ​​ Can you provide the link of that discussion? I am getting trouble in 
 searching that.
 
 ​​Long time ago I wrote a package that ​
 ​provides fast math functions (ufuncs) for numpy, using Intel’s MKL/VML 
 library, see  https://github.com/geggo/uvml https://github.com/geggo/uvml 
 and my comments ​​there. This code could be easily ported to use other 
 vector math libraries.
 
 ​When MKL is not available for a System, will this integration work with 
 default numpy maths functions?
 ​
 ​​ Would be interesting to evaluate other possibilities. Due to ​​the fact 
 that MKL is non-free, there are concerns to use it with numpy, ​​although 
 e.g. numpy and scipy using the MKL LAPACK ​​routines are used frequently 
 (Anaconda or Christoph Gohlkes  binaries). 
 
  You can easily inject the fast math ufuncs into numpy, e.g. with 
  set_numeric_ops() or np.sin = vml.sin. 
 
 ​Can you explain in a bit detail or provide a link where i can see it?​

My approach for https://github.com/geggo/uvml https://github.com/geggo/uvml 
was to provide a separate python extension that provides faster numpy ufuncs 
for math operations like exp, sin, cos, … To replace the standard numpy ufuncs 
by the optimized ones you don’t need to apply changes to the source code of 
numpy, instead at runtime you monkey patch it and get faster math everywhere. 
Numpy even offers an interface (set_numeric_ops) to modify it at runtime. 

Another note, numpy makes it easy to provide new ufuncs, see 
http://docs.scipy.org/doc/numpy-dev/user/c-info.ufunc-tutorial.html 
http://docs.scipy.org/doc/numpy-dev/user/c-info.ufunc-tutorial.html
from a C function that operates on 1D arrays, but this function needs to 
support arbitrary spacing (stride) between the items. Unfortunately, to achieve 
good performance, vector math libraries often expect that the items are laid 
out contiguously in memory. MKL/VML is a notable exception. So for non 
contiguous in- or output arrays you might need to copy the data to a buffer, 
which likely kills large amounts of the performance gain. This does not 
completely rule out some of the libraries, since performance critical data is 
likely to be stored in contiguous arrays.

Using a library that supports only vector math for contiguous arrays is more 
difficult, but perhaps the numpy nditer provides everything needed. 

Gregor___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Introductory mail and GSoc Project Vector math library integration

2015-03-11 Thread Gregor Thalhammer

 Am 08.03.2015 um 21:47 schrieb Dp Docs sdpa...@gmail.com:
 
 Hi all,
 I am a CS 3rd Undergrad. Student from an Indian Institute (III T). I
 believe I am good in Programming languages like C/C++, Python as I
 have already done Some Projects using these language as a part of my
 academics. I really like Coding (Competitive as well as development).
 I really want to get involved in Numpy Development Project and want to
 take  Vector math library integration as a part of my project. I
 want to here any idea from your side for this project.
 Thanks For your time for reading this email and responding back.
 

On the scipy mailing list I also answered to Amine, who is also interested in 
this proposal. Long time ago I wrote a package that provides fast math 
functions (ufuncs) for numpy, using Intel’s MKL/VML library, see  
https://github.com/geggo/uvml https://github.com/geggo/uvml and my comments 
there. This code could be easily ported to use other vector math libraries. 
Would be interesting to evaluate other possibilities. Due to the fact that MKL 
is non-free, there are concerns to use it with numpy, although e.g. numpy and 
scipy using the MKL LAPACK routines are used frequently (Anaconda or Christoph 
Gohlkes  binaries). 

You can easily inject the fast math ufuncs into numpy, e.g. with 
set_numeric_ops() or np.sin = vml.sin. 

Gregor___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Inverted indices

2014-08-07 Thread Gregor Thalhammer

Am 07.08.2014 um 13:16 schrieb Nicolas P. Rougier nicolas.roug...@inria.fr:

 
 Hi,
 
 I've a small problem for which I cannot find a solution and I'm quite sure 
 there is an obvious one:
 
 I've an array Z (any dtype) with some data.
 I've a (sorted) array I (of integer, same size as Z) that tells me the  index 
 of Z[i]  (if necessary, the index can be stored in Z).
 
 Now, I have an arbitrary sequence S of indices (in the sense of I), how do I 
 build the corresponding data ?
 
 Here is a small example:
 
 Z = [(0,0), (1,1), (2,2), (3,3), (4,4))
 I  = [0, 20, 23, 24, 37]
 
 S = [ 20,20,0,24]
 - Result should be [(1,1), (1,1), (0,0),(3,3)]
 
 S = [15,15]
 - Wrong (15 not in I) but ideally, I would like this to be converted to 
 [(0,0), (0,0)]
 
 
 Any idea ?
 

If I is sorted, I would propose to use a bisection algorithm, faster than 
linear search:

Z = array([(0,0), (1,1), (2,2), (3,3), (4,4)])
I = array([0, 20, 23, 24, 37])
S = array([ 20,20,0,24,15,27])

a = zeros(S.shape,dtype=int)
b = a + S.shape[0]-1
for i in range(int(log2(S.shape[0]))+2):
c = (a+b)1
sel = I[c]=S
a[sel] = c[sel]
b[~sel] = c[~sel]
Z[c]

If I[c] != S, then there is no corresponding index entry in I to match S.

Gregor

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Inverted indices

2014-08-07 Thread Gregor Thalhammer

Am 07.08.2014 um 13:59 schrieb Gregor Thalhammer gregor.thalham...@gmail.com:

 
 Am 07.08.2014 um 13:16 schrieb Nicolas P. Rougier nicolas.roug...@inria.fr:
 
 
 Hi,
 
 I've a small problem for which I cannot find a solution and I'm quite sure 
 there is an obvious one:
 
 I've an array Z (any dtype) with some data.
 I've a (sorted) array I (of integer, same size as Z) that tells me the  
 index of Z[i]  (if necessary, the index can be stored in Z).
 
 Now, I have an arbitrary sequence S of indices (in the sense of I), how do I 
 build the corresponding data ?
 
 Here is a small example:
 
 Z = [(0,0), (1,1), (2,2), (3,3), (4,4))
 I  = [0, 20, 23, 24, 37]
 
 S = [ 20,20,0,24]
 - Result should be [(1,1), (1,1), (0,0),(3,3)]
 
 S = [15,15]
 - Wrong (15 not in I) but ideally, I would like this to be converted to 
 [(0,0), (0,0)]
 
 
 Any idea ?
 
 
 If I is sorted, I would propose to use a bisection algorithm, faster than 
 linear search:
 
 Z = array([(0,0), (1,1), (2,2), (3,3), (4,4)])
 I = array([0, 20, 23, 24, 37])
 S = array([ 20,20,0,24,15,27])
 
 a = zeros(S.shape,dtype=int)
 b = a + S.shape[0]-1
 for i in range(int(log2(S.shape[0]))+2):
c = (a+b)1
sel = I[c]=S
a[sel] = c[sel]
b[~sel] = c[~sel]

or even simpler:
c = searchsorted(I, S)
Z[c]

Gregor

 Z[c]
 
 If I[c] != S, then there is no corresponding index entry in I to match S.
 
 Gregor

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] GSoC project: draft of proposal

2014-03-14 Thread Gregor Thalhammer

Am 13.03.2014 um 18:35 schrieb Leo Mao lmao20...@gmail.com:

 Hi,
 
 Thanks a lot for your advice, Chuck.
 Following your advice, I have modified my draft of proposal. (attachment)
 I think it still needs more comments so that I can make it better.
 
 And I found that maybe I can also make some functions related to linalg (like 
 dot, svd or something else) faster by integrating a proper library into numpy.
 
 Regards,
 Leo Mao
 
Dear Leo,

large parts of your proposal are covered by the uvml package 
https://github.com/geggo/uvml
In my opinion you should also consider Intels VML (part of MKL) as a candidate. 
(Yes I know, it is not free). To my best knowledge it provides many more 
vectorized functions than the open source alternatives. 
Concerning your time table, once you implemented support for one function, 
adding more functions is very easy. 

Gregor

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] GSoC project: draft of proposal

2014-03-14 Thread Gregor Thalhammer

Am 14.03.2014 um 11:00 schrieb Eric Moore e...@redtetrahedron.org:

 
 
 On Friday, March 14, 2014, Gregor Thalhammer gregor.thalham...@gmail.com 
 wrote:
 
 Am 13.03.2014 um 18:35 schrieb Leo Mao lmao20...@gmail.com:
 
  Hi,
 
  Thanks a lot for your advice, Chuck.
  Following your advice, I have modified my draft of proposal. (attachment)
  I think it still needs more comments so that I can make it better.
 
  And I found that maybe I can also make some functions related to linalg 
  (like dot, svd or something else) faster by integrating a proper library 
  into numpy.
 
  Regards,
  Leo Mao
 
 Dear Leo,
 
 large parts of your proposal are covered by the uvml package
 https://github.com/geggo/uvml
 In my opinion you should also consider Intels VML (part of MKL) as a 
 candidate. (Yes I know, it is not free). To my best knowledge it provides 
 many more vectorized functions than the open source alternatives.
 Concerning your time table, once you implemented support for one function, 
 adding more functions is very easy.
 
 Gregor
 
 
 I'm not sure that your week old project is enough to discourage this gsoc 
 project. In particular, it would be nice to be able to ship this directly as 
 part of numpy and that won't really be possible with mlk.
 
 Eric 
  

Hi,

it's not at all my intention to discourage this project. I hope Leo Mao can use 
the uvml package as a starting point for further improvements. Since most 
vectorized math libraries share a very similar interface, I think the actual 
choice of the library could be made a configurable option. Adapting uvml to use 
e.g. yeppp instead of MKL should be straightforward. Similar to numpy or scipy 
built with MKL lapack and distributed by enthought or Christoph Gohlke, using 
MKL should not be ruled out completely.

Gregor




___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy gsoc topic idea: configurable algorithm precision and vector math library integration

2014-03-06 Thread Gregor Thalhammer

Am 03.03.2014 um 20:20 schrieb Julian Taylor jtaylor.deb...@googlemail.com:

 hi,
 
 as the numpy gsoc topic page is a little short on options I was thinking
 about adding two topics for interested students. But as I have no
 experience with gsoc or mentoring and the ideas are not very fleshed out
 yet I'd like to ask if it might make sense at all:
 
 


 2. vector math library integration
 
 some operations like powers, sin, cos etc are relatively slow in numpy
 depending on the c library used. There are now a few free libraries
 available that make use of modern hardware to speed these operations up,
 e.g. sleef and yeppp (also mkl but I have no interest in supporting
 non-free software)
 It might be interesting to investigate if these libraries can be
 integrated with numpy.
 This also somewhat ties in with the configurable precision mode as the
 vector math libraries often have different options depending on
 precision and speed requirements.

I have been exhuming an old package I once wrote that wraps the vectorized math 
functions from Intels MKL for use with numpy. I made it available at 
https://github.com/geggo/uvml . I don't have access to MKL anymore, so no idea 
whether this package still works with recent numpy. If still useful, adapting 
to work with other libraries should not be difficult since they all provide a 
similar API. 
For serious work other packages like numexpr, numba or theano are much better. 
Nevertheless some might want to pick up this approach. 

Gregor
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Custom floating point representation to IEEE 754 double

2014-02-26 Thread Gregor Thalhammer

Am 25.02.2014 um 12:08 schrieb Daniele Nicolodi dani...@grinta.net:

 Hello,
 
 I'm dealing with an instrument that transfers numerical values through
 an RS232 port in a custom (?) floating point representation (56 bits, 4
 bits exponent and 52 bits significand).
 
 Of course I need to convert this format to a standard IEEE 754 double to
 be able to do anything useful with it.  I came up with this simple code:
 
  def tofloat(data):
  # 56 bits floating point representation
  # 4 bits exponent
  # 52 bits significand
  d = frombytes(data)
  l = 56
  p = l - 4
  e = int(d  p) + 17
  v = 0
  for i in xrange(p):
  b = (d  i)  0x01
  v += b * pow(2, i - p + e)
  return v
 
 where frombytes() is a simple function that assembles 7 bytes read from
 the serial port into an integer for easing the manipulation:
 
  def frombytes(bytes):
  # convert from bytes string
  value = 0
  for i, b in enumerate(reversed(bytes)):
  value += b * (1  (i * 8))
  return value
 
 I believe that tofloat() can be simplified a bit, but except
 optimizations (and cythonization) of this code, there is any simpler way
 of achieving this?
 

I have no ready made code at hand, but an alternative approach would be to use 
the struct module and 
assemble a standard double from your data by fiddling on the byte level. Most 
of the data you can just copy, your 4 bit exponent needs to be expanded 
(appending zeros?) to 12 bit. 
But since your data comes from a serial port, efficiency might not be important 
at all. 

Gregor

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpythonically getting elements with the minimum sum

2013-01-29 Thread Gregor Thalhammer

Am 28.1.2013 um 23:15 schrieb Lluís:

 Hi,
 
 I have a somewhat convoluted N-dimensional array that contains information of 
 a
 set of experiments.
 
 The last dimension has as many entries as iterations in the experiment (an
 iterative application), and the penultimate dimension has as many entries as
 times I have run that experiment; the rest of dimensions describe the features
 of the experiment:
 
data.shape == (... indefinite amount of dimensions ..., NUM_RUNS, 
 NUM_ITERATIONS)
 
 So, what I want is to get the data for the best run of each experiment:
 
best.shape == (... indefinite amount of dimensions ..., NUM_ITERATIONS)
 
 by selecting, for each experiment, the run with the lowest total time (sum of
 the time of all iterations for that experiment).
 
 
 So far I've got the trivial part, but not the final indexing into data:
 
dsum = data.sum(axis = -1)
dmin = dsum.min(axis = -1)
best = data[???]
 
 
 I'm sure there must be some numpythonic and generic way to get what I want, 
 but
 fancy indexing is beating me here :)

Did you have a look at the argmin function? It delivers the indices of the 
minimum values along an axis. Untested guess:

dmin_idx = argmin(dsum, axis = -1)
best = data[..., dmin_idx, :]

Gregor

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] mkl usage

2012-02-24 Thread Gregor Thalhammer

Am 24.2.2012 um 13:54 schrieb Neal Becker:

 Francesc Alted wrote:
 
 On Feb 23, 2012, at 2:19 PM, Neal Becker wrote:
 
 Pauli Virtanen wrote:
 
 23.02.2012 20:44, Francesc Alted kirjoitti:
 On Feb 23, 2012, at 1:33 PM, Neal Becker wrote:
 
 Is mkl only used for linear algebra?  Will it speed up e.g., elementwise
 transendental functions?
 
 Yes, MKL comes with VML that has this type of optimizations:
 
 And also no, in the sense that Numpy and Scipy don't use VML.
 
 
 My question is:
 
 Should I purchase MKL?
 
 To what extent will it speed up my existing python code, without my having 
 to
 exert (much) effort?
 
 So that would be numpy/scipy.
 
 Pauli already answered you.  If you are restricted to use numpy/scipy and 
 your
 aim is to accelerate the evaluation of transcendental functions, then there 
 is
 no point in purchasing MKL.  If you can open your spectrum and use numexpr,
 then I think you should ponder about it.
 
 -- Francesc Alted
 
 Thanks.  One more thing, on theano I'm guessing MKL is required to be 
 installed 
 onto each host that would use it at runtime?  So I'd need a licensed copy for 
 each host that will execute the code?  I'm guessing that because theano needs 
 to 
 compile code at runtime.


No, this is not a must. Theano uses the MKL (BLAS) only for linear algebra (dot 
products). For linear algebra, Theano can also use numpy instead of directly 
linking to the MKL. So if you have numpy with an optimized BLAS library (MKL) 
available, you can indirectly use MKL without having it installed on each 
computer you use Theano. 

If you want to use fast transcendental functions, you should go for numexpr. 
Christoph Gohlke provides Windows binaries linked against MKL (for numpy and 
numexpr), so you don't need to purchase MKL to check how much performance 
increase you can gain.

Long ago I wrote a small package that injects the transcendental functions of 
MKL/VML into numpy, so you could get better performance without making any 
changes to your code. But my experience is that you gain only little because 
typically the performance is limited by memory bandwidth. Perhaps now on 
multi-core hosts you gain more due since the VML library uses multi-threading.

Gregor 
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Owndata flag

2011-12-16 Thread Gregor Thalhammer

Am 16.12.2011 um 11:53 schrieb Fabrice Silva:

 Le jeudi 15 décembre 2011 à 18:09 +0100, Gregor Thalhammer a écrit :
 
 There is an excellent blog entry from Travis Oliphant, that describes
 how to create a ndarray from existing data without copy:
 http://blog.enthought.com/?p=62
 The created array does not actually own the data, but its base
 attribute points to an object, which frees the memory if the numpy
 array gets deallocated. I guess this is the behavior you want to
 achieve. 
 Here is a cython implementation (for a uint8 array)
 
 Even better: the addendum!
 http://blog.enthought.com/python/numpy/simplified-creation-of-numpy-arrays-from-pre-allocated-memory/
 
 Within cython:
 cimport numpy
 numpy.set_array_base(my_ndarray,  PyCObject_FromVoidPtr(pointer_to_Cobj, 
 some_destructor))
 
 Seems OK.
 Any objections about that ?

This is ok, but CObject is deprecated as of Python 3.1, so it's not portable to 
Python 3.2.

Gregor
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Owndata flag

2011-12-15 Thread Gregor Thalhammer

Am 15.12.2011 um 17:17 schrieb Fabrice Silva:

 How can one arbitrarily assumes that an ndarray owns its data ?
 
 More explicitly, I have some temporary home-made C structure that holds
 a pointer to an array. I prepare (using Cython) an numpy.ndarray using
 the PyArray_NewFromDescr function. I can delete my temporary C structure
 without freeing the memory holding array, but I wish the numpy.ndarray
 becomes the owner of the data. 
 
 How can do I do such thing ?

There is an excellent blog entry from Travis Oliphant, that describes how to 
create a ndarray from existing data without copy: 
http://blog.enthought.com/?p=62
The created array does not actually own the data, but its base attribute points 
to an object, which frees the memory if the numpy array gets deallocated. I 
guess this is the behavior you want to achieve. 
Here is a cython implementation (for a uint8 array)

Gregor




see 'NumPy arrays with pre-allocated memory', http://blog.enthought.com/?p=62


import numpy as np
from numpy cimport import_array, ndarray, npy_intp, set_array_base, 
PyArray_SimpleNewFromData, NPY_DOUBLE, NPY_INT, NPY_UINT8 

cdef extern from stdlib.h:
   void* malloc(int size)
   void free(void *ptr)

cdef class MemoryReleaser:
   cdef void* memory

   def __cinit__(self):
   self.memory = NULL

   def __dealloc__(self):
   if self.memory:
   #release memory
   free(self.memory)
   print memory released, hex(longself.memory)
   
cdef MemoryReleaser MemoryReleaserFactory(void* ptr):
   cdef MemoryReleaser mr = MemoryReleaser.__new__(MemoryReleaser)
   mr.memory = ptr
   return mr

cdef ndarray frompointer(void* ptr, int nbytes):
   import_array()
   #cdef int dims[1]
   #dims[0] = nbytes
   cdef npy_intp dims = npy_intpnbytes
   cdef ndarray arr = PyArray_SimpleNewFromData(1, dims, NPY_UINT8, ptr)
   #TODO: check for error
   set_array_base(arr, MemoryReleaserFactory(ptr))
   
   return arr
   
def test_new_array_from_pointer():
nbytes = 16
cdef void* mem = malloc(nbytes)
print memory allocated, hex(longmem)
return frompointer(mem, nbytes)

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] polynomial with negative exponents

2011-12-12 Thread Gregor Thalhammer

Am 12.12.2011 um 15:04 schrieb LASAGNA DAVIDE:

 Hi,
 
 I have written a class for polynomials with negative 
 exponents like:
 
 p(x) = a0 + a1*x**-1 + ... + an*x**-n
 
 The code is this one:
 
 class NegativeExpPolynomial( object ):
  def __init__ ( self, coeffs ):
  self.coeffs = np.array( coeffs )
 
  def __call__( self, x ):
  return sum( (c*x**(-i) for i, c in enumerate( 
 self.coeffs ) ) )
 
 where coeffs = [a0, a1, ..., an].
 
 I find that the way i evaluate the polynomial is kind of 
 *slow*, especially for polynomial with order larger than 
 ~200 and for arrays x large enough.

I fear that such high orders create a lot of troubles since evaluating them is 
very sensitive to numerical errors.

 Do you have suggestions on how to speed up this code?
 

Your polynomials with negative exponents are equivalent to a polynomial with 
positive exponents, but evaluated at 1/x. Therefore you can make use of the 
efficient polynomial functions of numpy. Try

np.polyval(self.coeffs, x)


Gregor
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Help in speeding up accumulation in a matrix

2011-01-30 Thread Gregor Thalhammer

Am 29.1.2011 um 22:01 schrieb Nicolas SCHEFFER:

 Hi all,
 
 First email to the list for me, I just want to say how grateful I am
 to have python+numpy+ipython etc... for my day to day needs. Great
 combination of software.
 
 Anyway, I've been having this bottleneck in one my algorithms that has
 been bugging me for quite a while.
 The objective is to speed this part up. I've been doing tons of
 optimization and parallel processing around that piece of code to get
 a decent run time.
 
 The problem is easy. You want to accumulate in a matrix, a weighted
 sum of other matrices. Let's call this function scale_and_add:
 def scale_and_add_re(R,w,Ms):
(nb_add,mdim,j)=np.shape(Ms)
for i in range(nb_add):
R+=w[i]*Ms[i]
return R
 This 'for' loop bugs me since I know this will slow things down.
 
 But the dimension of these things are so large that any attempt to
 vectorize this is slower and takes too much memory.
 I typically work with 1000 weights and matrices, matrices of dimension
 of several hundred.
 
 My current config is:
 In [2]: %timeit scale_and_add_re(R,w,Ms)
 1 loops, best of 3: 392 ms per loop
 
 In [3]: w.shape
 Out[3]: (2000,)
 
 In [4]: Ms.shape
 Out[4]: (2000, 200, 200)
 I'd like to be able to double these dimensions.
 
 For instance I could use broadcasting by using a dot product
 %timeit dot(Ms.T,w)
 1 loops, best of 3: 1.77 s per loop
 But this is i) slower ii) takes too much memory
 (btw, I'd really need an inplace dot-product in numpy to avoid the
 copy in memory, like the blas call in scipy.linalg. But that's for an
 other thread...)

If you use a different memory layout for your data, you can improve your 
performance with dot:

MsT = Ms.T.copy()
%timeit np.dot(M,w)

10 loops, best of 3: 107 ms per loop

for comparison:
In [32]: %timeit scale_and_add_re(R,w,Ms)
1 loops, best of 3: 245 ms per loop

I don't think you can do much better than this. The above value gives a memory 
bandwidth for accessing Ms of 6 GB/s, I think the hardware limit for my system 
is about 10 Gb/s. 

Gregor
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] glumpy, fast opengl visualization

2010-01-27 Thread Gregor Thalhammer
2010/1/25 Nicolas Rougier nicolas.roug...@loria.fr:


 Hello,

 This is an update about glumpy, a fast-OpenGL based numpy visualization.
 I modified the code such that the only dependencies are PyOpenGL and
 IPython (for interactive sessions). You will also need matplotlib and
 scipy for some demos.

 Sources: hg clone http://glumpy.googlecode.com/hg/ glumpy
 No installation required, you can run all demos inplace.

 Homepage: http://code.google.com/p/glumpy/

Hi Nicolas,

thank you for providing glumpy. I started using it for my own project.
The previous, pyglet based version of glumpy worked flawlessly on my
system (WinXP). I want to report about problems with the latest
version.
1.) On Windows glumpy fails on importing the Python termios module,
since it is for Unix platforms only.
2.) Resolving this, the demos start, but are mostly not usable, since
mouse scroll events and passive mouse motions events are not created.
This might be a problem of the glut implementation on Windows (I am
using PyOpenGL 3.0.1b2). Who knows?
3.) On OpenSuse11.1 the demos fail at 'glCreateProgram'. The demos
used to work with the previous version. I use the Intel Q45 graphics
chipset.

In the future I might spend some time on this problems and I would
like to contribute to glumpy, even if it's only testing on platforms
available to me.

Gregor
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Objected-oriented SIMD API for Numpy

2009-10-22 Thread Gregor Thalhammer
2009/10/21 Neal Becker ndbeck...@gmail.com

 ...
  I once wrote a module that replaces the built in transcendental
  functions of numpy by optimized versions from Intels vector math
  library. If someone is interested, I can publish it. In my experience it
  was of little use since real world problems are limited by memory
  bandwidth. Therefore extending numexpr with optimized transcendental
  functions was the better solution. Afterwards I discovered that I could
  have saved the effort of the first approach since gcc is able to use
  optimized functions from Intels vector math library or AMD's math core
  library, see the doc's of -mveclibabi. You just need to recompile numpy
  with proper compiler arguments.
 

 I'm interested.  I'd like to try AMD rather than intel, because AMD is
 easier to obtain.  I'm running on intel machine, I hope that doesn't matter
 too much.

 What exactly do I need to do?

I once tried to recompile numpy with AMD's AMCL. Unfortunately I lost the
settings after an upgrade. What I remember: install AMCL, (and read the docs
;-) ), mess with the compiler args (-mveclibabi and related), link with the
AMCL. Then you get faster pow/sin/cos/exp. The transcendental functions of
AMCL also work with Intel processors with the same performance. I did not
try the Intel SVML, which belongs to the Intel compilers.
This is different to the first approach, which is a small wrapper for Intels
VML, put into a python module and which can inject it's ufuncs (via
numpy.set_numeric_ops) into numpy. If you want I can send the package per
private email.


 I see that numpy/site.cfg has an MKL section.  I'm assuming I should not
 touch that, but just mess with gcc flags?

This is for using the lapack provided by Intels MKL. These settings are not
related to the above mentioned compiler options.


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] numpy ufuncs and COREPY - any info?

2009-05-22 Thread Gregor Thalhammer
dmitrey schrieb:
 hi all,
 has anyone already tried to compare using an ordinary numpy ufunc vs
 that one from corepy, first of all I mean the project
 http://socghop.appspot.com/student_project/show/google/gsoc2009/python/t124024628235

 It would be interesting to know what is speedup for (eg) vec ** 0.5 or
 (if it's possible - it isn't pure ufunc) numpy.dot(Matrix, vec). Or
 any another example.
   
I have no experience with the mentioned CorePy, but recently I was 
playing around with accelerated ufuncs using Intels Math Kernel Library 
(MKL). These improvements are now part of the numexpr package 
http://code.google.com/p/numexpr/
Some remarks on possible speed improvements on recent Intel x86 processors.
1) basic arithmetic ufuncs (add, sub, mul, ...) in standard numpy are 
fast (SSE is used) and speed is limited by memory bandwidth.
2) the speed of many transcendental functions (exp, sin, cos, pow, ...) 
can be improved by _roughly_ a factor of five (single core) by using the 
MKL. Most of the improvements stem from using faster algorithms with a 
vectorized implementation. Note: the speed improvement depends on a 
_lot_ of other circumstances.
3) Improving performance by using multi cores is much more difficult. 
Only for sufficiently large (1e5) arrays a significant speedup is 
possible. Where a speed gain is possible, the MKL uses several cores. 
Some experimentation showed that adding a few OpenMP constructs you 
could get a similar speedup with numpy.
4) numpy.dot uses optimized implementations.

Gregor
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] where are the benefits of ldexp and/or array times 2?

2009-05-22 Thread Gregor Thalhammer
dmitrey schrieb:
 Hi all,
 I expected to have some speedup via using ldexp or multiplying an
 array by a power of 2 (doesn't it have to perform a simple shift of
 mantissa?), but I don't see the one.

 # Let me also note -
 # 1) using b = 2 * ones(N) or b = zeros(N) doesn't yield any speedup
 vs b = rand()
 # 2) using A * 2.0 (or mere 2) instead of 2.1 doesn't yield any
 speedup, despite it is exact integer power of 2.
   
On recent processors multiplication is very fast and takes 1.5 clock 
cycles (float, double precision), independent of the values. There is 
very little gain by using bit shift operators.

Gregor
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] memoization with ndarray arguments

2009-03-22 Thread Gregor Thalhammer
Paul Northug schrieb:
 I would like to 'memoize' the objective, derivative and hessian
 functions, each taking a 1d double ndarray argument X, that are passed
 as arguments to
 scipy.optimize.fmin_ncg.

 Each of these 3 functions has calculations in common that are
 expensive to compute and are a function of X. It seems fmin_ncg
 computes these quantities at the same X over the course of the
 optimization.

 How should I go about doing this?
   
Exactly for this purpose I was using something like:
cache[tuple(X)] = (subexpression1, subexpression2)
This worked fine for me. In your use case it might be enought to store 
only the latest result to avoid excessive memory  usage, since typically 
the same X is used for consecutive calls of objective, derivative and 
hessian functions.

Gregor
 numpy arrays are not hashable, maybe for a good reason. I tried anyway
 by  keeping a dict of hash(tuple(X)), but started having collisions.
 So I switched to md5.new(X).digest() as the hash function and it seems
 to work ok. In a quick search, I saw cPickle.dumps and repr are also
 used as key values.

 I am assuming this is a common problem with functions with numpy array
 arguments and was wondering what the best approach is (including not
 using memoization).
   

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Build with MKL on linux

2009-02-20 Thread Gregor Thalhammer
Neal Becker schrieb:
 Trying to build numpy-1.2.1 with intel mkl 10.1.1.019 on linux F10 x86_64.

 echo $LD_LIBRARY_PATH
 /opt/intel/mkl/10.1.1.019/lib/em64t


 strace -e trace=file python -c 'import numpy; numpy.test()' 2stuff
 Running unit tests for numpy
 NumPy version 1.2.1
 NumPy is installed in /usr/lib64/python2.5/site-packages/numpy
 Python version 2.5.2 (r252:60911, Sep 30 2008, 15:42:03) [GCC 4.3.2 20080917
 (Red Hat 4.3.2-4)]
 nose version 0.10.3

 MKL FATAL ERROR: Cannot load neither libmkl_mc.so nor libmkl_def.so

 strace shows:
 ...
 open(/usr/lib64/python2.5/site-packages/numpy/core/tests/test_blasdot.py,
 O_RDONLY) = 3
 open(/usr/lib64/python2.5/site-packages/numpy/core/tests/test_blasdot.pyc,
 O_RDONLY) = 5
 open(/opt/intel/mkl/10.1.1.019/lib/em64t/libmkl_mc.so, O_RDONLY) = 3
 open(/opt/intel/mkl/10.1.1.019/lib/em64t/libmkl_mc.so, O_RDONLY) = 3
 open(/opt/intel/mkl/10.1.1.019/lib/em64t/libmkl_def.so, O_RDONLY) = 3
 open(/opt/intel/mkl/10.1.1.019/lib/em64t/libmkl_def.so, O_RDONLY) = 3

 The files are there and found, what could be wrong?

   
Try MKL version 10.0.2.018. Some more information about this problem you 
find on 
http://software.intel.com/en-us/forums/intel-math-kernel-library/topic/60460/

Gregor
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast threading solution thoughts

2009-02-12 Thread Gregor Thalhammer
Brian Granger schrieb:
 I am curious: would you know what would be different in numpy's case
 compared to matlab array model concerning locks ? Matlab, up to
 recently, only spreads BLAS/LAPACK on multi-cores, but since matlab 7.3
 (or 7.4), it also uses multicore for mathematical functions (cos,
 etc...). So at least in matlab's model, it looks like it can be useful.
 

 Good point.  Is it possible to tell what array size it switches over
 to using multiple threads?  Also, do you happen to iknow how Matlab is
 doing this?

   
Recent Matlab versions use Intels Math Kernel Library, which performs 
automatic multi-threading - also for mathematical functions like sin 
etc, but not for  addition, multiplication etc. It seems to me Matlab 
itself does not take care of multi-threading. On
http://www.intel.com/software/products/mkl/data/vml/functions/_listfunc.html
you can have a look at the performance data of the MKL vectorized math 
functions. Around a vector length between 100-1000, depending on which 
function, precision, cpu architecture, they switch to multi-threading.

Gregor
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] numexpr and numpy windows binaries built with MKL

2009-02-12 Thread Gregor Thalhammer
Hi all,

as Francesc announced, the latest release of Numexpr 1.2 can be built 
with Intels Math Kernel Library, which gives a BIGbig increase in 
performance. Now the questions: Could somebody provide binaries for 
Windows of Numexpr, linked with Intels MKL? I know, there is the license 
problem. The same question has been discussed for numpy. Is it still 
true that Enthought is considering to provide binaries based on MKL, see 
https://svn.enthought.com/epd/ticket/216  ? 
https://svn.enthought.com/epd/ticket/216

Gregor

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Integer cast problems

2009-02-12 Thread Gregor Thalhammer
Ralph Kube schrieb:
 Hi there,
 I have a little problem here with array indexing, hope you see the problem.
 I use the following loop to calculate some integrals

 ...
 0.145 * 0.005 = 28.996
 N.int32(0.145 * 0.005) = 28
conversion to int truncates, it doesn't round. Try
N.int32(0.145 * 0.005  +  0.5)

Gregor
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-16 Thread Gregor Thalhammer
Francesc Alted schrieb:
 Numexpr is a fast numerical expression evaluator for NumPy.  With it,
 expressions that operate on arrays (like 3*a+4*b) are accelerated
 and use less memory than doing the same calculation in Python.

 The expected speed-ups for Numexpr respect to NumPy are between 0.95x
 and 15x, being 3x or 4x typical values.  The strided and unaligned
 case has been optimized too, so if the expresion contains such arrays, 
 the speed-up can increase significantly.  Of course, you will need to 
 operate with large arrays (typically larger than the cache size of your 
 CPU) to see these improvements in performance.

   
Just recently I had a more detailed look at numexpr. Clever idea, easy 
to use! I can affirm an typical performance gain of 3x if you work on 
large arrays (100k entries).

I also gave a try to the vector math library (VML), contained in Intel's 
Math Kernel Library. This offers a fast implementation of mathematical 
functions, operating on array. First I implemented a C extension, 
providing new ufuncs. This gave me a big performance gain, e.g., 2.3x 
(5x) for sin, 6x (10x) for exp, 7x (15x) for pow, and 3x (6x) for 
division (no gain for add, sub, mul). The values in parantheses are 
given if I allow VML to use several threads and to employ both cores of 
my Intel Core2Duo computer. For large arrays (100M entries) this 
performance gain is reduced because of limited memory bandwidth. At this 
point I was stumbling across numexpr and modified it to use the VML 
functions. For sufficiently long and complex  numerical expressions I 
could  get the maximum performance also for large arrays.  Together with 
VML numexpr seems to be a extremely powerful to get an optimum 
performance. I would like to see numexpr extended to (optionally) make 
use of fast vectorized math functions. There is one but: VML supports 
(at the moment) only math on contiguous arrays. At a first try I didn't 
understand how to enforce this limitation in numexpr. I also gave a 
quick try to the equivalent vector math library, acml_mv of AMD. I only 
tried sin and log, gave me the same performance (on a Intel processor!) 
like Intels VML .

I was also playing around with the block size in numexpr. What are the 
rationale that led to the current block size of 128? Especially with 
VML, a larger block size of 4096 instead of 128 allowed  to efficiently 
use multithreading in VML.
 Share your experience
 =

 Let us know of any bugs, suggestions, gripes, kudos, etc. you may
 have.

   
I was missing the support for single precision floats.

Great work!

Gregor
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] ANN: Numexpr 1.1, an efficient array evaluator

2009-01-16 Thread Gregor Thalhammer
Francesc Alted schrieb:
 A Friday 16 January 2009, Gregor Thalhammer escrigué:
   
 I also gave a try to the vector math library (VML), contained in
 Intel's Math Kernel Library. This offers a fast implementation of
 mathematical functions, operating on array. First I implemented a C
 extension, providing new ufuncs. This gave me a big performance gain,
 e.g., 2.3x (5x) for sin, 6x (10x) for exp, 7x (15x) for pow, and 3x
 (6x) for division (no gain for add, sub, mul).
 

 Wow, pretty nice speed-ups indeed!  In fact I was thinking in including 
 support for threading in Numexpr (I don't think it would be too 
 difficult, but let's see).  BTW, do you know how VML is able to achieve 
 a speedup of 6x for a sin() function?  I suppose this is because they 
 are using SSE instructions, but, are these also available for 64-bit 
 double precision items?
   
I am not an expert on SSE instructions, but to my knowledge there exist 
(in the Core 2 architecture) no SSE instruction to calculate the sin. 
But it seems to be possible to (approximately) calculate a sin with a 
couple of multiplication/ addition instructions (and they exist in SSE 
for 64-bit float). Intel (and AMD) seems to use a more clever algorithm, 
efficiently implemented than the standard implementation.
 Well, if you can provide the code, I'd be glad to include it in numexpr.  
 The only requirement is that the VML must be optional during the build 
 of the package.
   
Yes, I will try to provide you with a polished version of my changes, 
making them optional.
   
 There is one but: VML supports (at the moment) only math 
 on contiguous arrays. At a first try I didn't understand how to
 enforce this limitation in numexpr.
 

 No problem.  At the end of the numexpr/necompiler.py you will see some 
 code like:

 # All the opcodes can deal with strided arrays directly as
 # long as they are undimensional (strides in other
 # dimensions are dealt within the extension), so we don't
 # need a copy for the strided case.
 if not b.flags.aligned:
...

 which you can replace with something like:

 # need a copy for the strided case.
 if VML_available and not b.flags.contiguous: 
   b = b.copy()
 elif not b.flags.aligned:
   ...

 That would be enough for ensuring that all the arrays are contiguous 
 when they hit numexpr's virtual machine.
   
Ah I see, that's not difficult. I thought copying is done in the virtual 
machine. (didn't read all the code ...)
 Being said this, it is a shame that VML does not have support for 
 strided/unaligned arrays.  They are quite common beasts, specially when 
 you work with heterogeneous arrays (aka record arrays).
   
I have the impression that you can already feel happy if these 
mathematical libraries support a C interface, not only Fortran. At least 
the Intel VML provides functions to pack/unpack strided arrays which 
seem work on a broader parameter range than specified (also zero or 
negative step sizes).
 I also gave a quick try to the 
 equivalent vector math library, acml_mv of AMD. I only tried sin and
 log, gave me the same performance (on a Intel processor!) like Intels
 VML .

 I was also playing around with the block size in numexpr. What are
 the rationale that led to the current block size of 128? Especially
 with VML, a larger block size of 4096 instead of 128 allowed  to
 efficiently use multithreading in VML.
 

 Experimentation.  Back in 2006 David found that 128 was optimal for the 
 processors available by that time.  With the Numexpr 1.1 my experiments 
 show that 256 is a better value for current Core2 processors and most 
 expressions in our benchmark bed (see benchmarks/ directory); hence, 
 256 is the new value for the chunksize in 1.1.  However, be in mind 
 that 256 has to be multiplied by the itemsize of each array, so the 
 chunksize is currently 2048 bytes for 64-bit items (int64 or float64) 
 and 4096 for double precision complex arrays, which are probably the 
 sizes that have to be compared with VML.
   
So the optimum block size  might depend  on the type of expression and 
if VML functions are used. On question: the block size is set by a 
#define, is there a significantly poorer performance if you use a 
variable instead? Would be more flexible, especially for testing and tuning.
 I was missing the support for single precision floats.
 

 Yeah.  This is because nobody has implemented it before, but it is 
 completely doable.
   

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] contiguous regions

2008-11-20 Thread Gregor Thalhammer
John Hunter schrieb:
 I frequently want to break a 1D array into regions above and below
 some threshold, identifying all such subslices where the contiguous
 elements are above the threshold.  I have two related implementations
 below to illustrate what I am after.  The first crossings is rather
 naive in that it doesn't handle the case where an element is equal to
 the threshold (assuming zero for the threshold in the examples below).
  The second is correct (I think) but is pure python.  Has anyone got a
 nifty numpy solution for this?

 import numpy as np
 import matplotlib.pyplot as plt
 t = np.arange(0.0123, 2, 0.05)
 s = np.sin(2*np.pi*t)

   
here my proposal, needs some polishing:

mask = (s0).astype(int8)
d = diff(mask)
idx, = d.nonzero()
#now handle the cases that s is above threshold at beginning or end of 
sequence
if d[idx[0]] == -1:
   idx = r_[0, idx]
if d[idx[-1]] == 1:
   idx = r_[idx, len(s)]
idx.shape = (-1,2)

Gregor

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Fast and efficient way to convert an array into binary

2008-11-18 Thread Gregor Thalhammer
frank wang schrieb:
 Hi,
  
 I have a large array and I want to convert it into a binary array. For 
 exampe, y=array([1,2,3]), after the convertion I want the result 
 array([0,0,0,1,0,0,1,0,0,0,1,1]). Each digit is converted into 4 bits 
 in this example. In my real problem I want to convert each digit to 8 
 bits. My data is numpy.ndarray and the shape is, say, (1000,).
  
 Are there fast and efficient solution for this?
  
  
  a = array([1,2,3], dtype = uint8)
  unpackbits(a)
array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
   1], dtype=uint8)

Gregor
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] failure with numpy.inner

2008-07-05 Thread Gregor Thalhammer
After upgrading to NumPy 1.1.0 (I installed 
numpy-1.1.0-win32-superpack-pyhon2.5) I observed a fatal failure with 
the following code which uses numpy.inner

import numpy
F = numpy.zeros(shape = (1,79), dtype = numpy.float64)
#this suceeds
FtF = numpy.inner(F,F.copy())
#this fails
FtF = numpy.inner(F,F)

The failure (Exception code 0xc005) happens in _dotblas.pyd. I use 
Windows XP on a Intel Core2Duo system. Further, I could observe that it 
depends on the size of the array wether this failure appears or not.

I submitted this report also as ticket #844.

I hope somebody can track down the reason for this behaviour (and fix it).

Gregor Thalhammer
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion