On Feb 20, 2012, at 7:08 PM, Dag Sverre Seljebotn wrote: > On 02/20/2012 09:34 AM, Christopher Jordan-Squire wrote: >> On Mon, Feb 20, 2012 at 9:18 AM, Dag Sverre Seljebotn >> <[email protected]> wrote: >>> On 02/20/2012 08:55 AM, Sturla Molden wrote: >>>> Den 20.02.2012 17:42, skrev Sturla Molden: >>>>> There are still other options than C or C++ that are worth considering. >>>>> One would be to write NumPy in Python. E.g. we could use LLVM as a >>>>> JIT-compiler and produce the performance critical code we need on the fly. >>>>> >>>>> >>>> >>>> LLVM and its C/C++ frontend Clang are BSD licenced. It compiles faster >>>> than GCC and often produces better machine code. They can therefore be >>>> used inside an array library. It would give a faster NumPy, and we could >>>> keep most of it in Python. >>> >>> I think it is moot to focus on improving NumPy performance as long as in >>> practice all NumPy operations are memory bound due to the need to take a >>> trip through system memory for almost any operation. C/C++ is simply >>> "good enough". JIT is when you're chasing a 2x improvement or so, but >>> today NumPy can be 10-20x slower than a Cython loop. >>> >> >> I don't follow this. Could you expand a bit more? (Specifically, I >> wasn't aware that numpy could be 10-20x slower than a cython loop, if >> we're talking about the base numpy library--so core operations. I'm > > The problem with NumPy is the temporaries needed -- if you want to compute > > A + B + np.sqrt(D) > > then, if the arrays are larger than cache size (a couple of megabytes), > then each of those operations will first transfer the data in and out > over the memory bus. I.e. first you compute an element of sqrt(D), then > the result of that is put in system memory, then later the same number > is read back in order to add it to an element in B, and so on. > > The compute-to-bandwidth ratio of modern CPUs is between 30:1 and > 60:1... so in extreme cases it's cheaper to do 60 additions than to > transfer a single number from system memory. > > It is much faster to only transfer an element (or small block) from each > of A, B, and D to CPU cache, then do the entire expression, then > transfer the result back. This is easy to code in Cython/Fortran/C and > impossible with NumPy/Python. > > This is why numexpr/Theano exists.
Well, I can't speak for Theano (it is quite more general than numexpr, and more geared towards using GPUs, right?), but this was certainly the issue that make David Cooke to create numexpr. A more in-deep explanation about this problem can be seen in: http://www.euroscipy.org/talk/1657 which includes some graphical explanations. -- Francesc Alted _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
