Re: [Numpy-discussion] The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)

2014-04-28 Thread Michael Lehn

Am 11 Apr 2014 um 19:05 schrieb Sturla Molden sturla.mol...@gmail.com:

 Sturla Molden sturla.mol...@gmail.com wrote:
 
 Making a totally new BLAS might seem like a crazy idea, but it might be the
 best solution in the long run. 
 
 To see if this can be done, I'll try to re-implement cblas_dgemm and then
 benchmark against MKL, Accelerate and OpenBLAS. If I can get the
 performance better than 75% of their speed, without any assembly or dark

So what percentage on performance did you achieve so far?

Cheers,

Michael
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Dates and times and Datetime64 (again)

2014-04-28 Thread Andreas Hilboll
On 19.04.2014 09:03, Andreas Hilboll wrote:
 On 14.04.2014 20:59, Chris Barker wrote:
 On Fri, Apr 11, 2014 at 4:58 PM, Stephan Hoyer sho...@gmail.com
 mailto:sho...@gmail.com wrote:

 On Fri, Apr 11, 2014 at 3:56 PM, Charles R Harris
 charlesr.har...@gmail.com mailto:charlesr.har...@gmail.com wrote:

 Are we in a position to start looking at implementation? If so,
 it would be useful to have a collection of test cases, i.e.,
 typical uses with specified results. That should also cover
 conversion from/(to?) datetime.datetime.


 yup -- tests are always good! 

 Indeed, my personal wish-list for np.datetime64 is centered much
 more on robust conversion to/from native date objects, including
 comparison.


 A good use case. 
  

 Here are some of my particular points of frustration (apologies for
 the thread jacking!):
 - NaT should have similar behavior to NaN when used for comparisons
 (i.e., comparisons should always be False).


 make sense.
  

 - You can't compare a datetime object to a datetime64 object.


 that would be nice to have.
  

 - datetime64 objects with high precision (e.g., ns) can't compare to
 datetime objects.


 That's a problem, but how do you think it should be handled? My thought
 is that it should round to microseconds, and then compare -- kind of
 like comparing float32 and float64...
  

 Pandas has a very nice wrapper around datetime64 arrays that solves
 most of these issues, but it would be nice to get much of that
 functionality in core numpy,


 yes -- it would -- but learning from pandas is certainly a good idea.


 from numpy import datetime64
 from datetime import datetime

 print np.datetime64('NaT')  np.datetime64('2011-01-01') # this
 should not to true
 print datetime(2010, 1, 1)  np.datetime64('2011-01-01') # raises
 exception
 print np.datetime64('2011-01-01T00:00', 'ns')  datetime(2010, 1, 1)
 # another exception
 print np.datetime64('2011-01-01T00:00')  datetime(2010, 1, 1) #
 finally something works!


 now to get them into proper unit tests
 
 As one further suggestion, I think it would be nice if doing arithmetic
 using np.datetime64 and datetime.timedelta objects would work:
 
np.datetime64(2011,1,1) + datetime.timedelta(1) ==
 np.datetime64(2011,1,2)
 
 And of course, but this is probably in the loop anyways,
 np.asarray([list_of_datetime.datetime_objects]) should work as expected.

One more wish / suggestion from my side (apologies if this isn't the
place to make wishes):

Array-wide access to the individual datetime components should work, i.e.,

   datetime64array.year

should yield an array of dtype int with the years.  That would allow
boolean indexing to filter data, like

   datetime64array[datetime64array.year == 2014]

would yield all entries from 2014.

Cheers,

-- Andreas.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 64-bit windows numpy / scipy wheels for testing

2014-04-28 Thread Ralf Gommers
On Mon, Apr 28, 2014 at 12:39 AM, Sturla Molden sturla.mol...@gmail.comwrote:

 Pauli Virtanen p...@iki.fi wrote:

  Yes, Windows is the only platform on which Fortran was problematic. OSX
  is somewhat saner in this respect.

 Oh yes, it seems there are official unofficial gfortran binaries
 available for OSX:

 http://gcc.gnu.org/wiki/GFortranBinaries#MacOS


I'd be interested to hear if those work well for you. For people that just
want to get things working, I would recommend to use the gfortran
installers recommended at
http://scipy.org/scipylib/building/macosx.html#compilers-c-c-fortran-cython.
Those work for sure, and alternatives have usually proven to be problematic
in the past.

Ralf


 Cool :)


 Sturla

 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 64-bit windows numpy / scipy wheels for testing

2014-04-28 Thread Sturla Molden
Ralf Gommers ralf.gomm...@gmail.com wrote:

 I'd be interested to hear if those work well for you. For people that just
 want to get things working, I would recommend to use the gfortran
 installers recommended at
 a
 href=http://scipy.org/scipylib/building/macosx.html#compilers-c-c-fortran-cython.;http://scipy.org/scipylib/building/macosx.html#compilers-c-c-fortran-cython./a
 Those work for sure, and alternatives have usually proven to be problematic
 in the past.

No problems thus far, but I only installed it yesterday. :-)

I am not sure gcc-4.2 is needed anymore. Apple has retired it as platform C
compiler on OS X. We need a Fortran compiler that can be used together with
clang as C compiler.

Sturla

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 64-bit windows numpy / scipy wheels for testing

2014-04-28 Thread Ralf Gommers
On Mon, Apr 28, 2014 at 6:06 PM, Sturla Molden sturla.mol...@gmail.comwrote:

 Ralf Gommers ralf.gomm...@gmail.com wrote:

  I'd be interested to hear if those work well for you. For people that
 just
  want to get things working, I would recommend to use the gfortran
  installers recommended at
  a
  href=
 http://scipy.org/scipylib/building/macosx.html#compilers-c-c-fortran-cython
 .
 http://scipy.org/scipylib/building/macosx.html#compilers-c-c-fortran-cython
 ./a
  Those work for sure, and alternatives have usually proven to be
 problematic
  in the past.

 No problems thus far, but I only installed it yesterday. :-)


Sounds good. Let's give it a bit more time, once you've given it a good
workout we can add that those gfortran 4.8.x compilers seem to work fine to
the scipy build instructions.

I am not sure gcc-4.2 is needed anymore. Apple has retired it as platform C
 compiler on OS X. We need a Fortran compiler that can be used together with
 clang as C compiler.


Clang together with gfortran 4.2 works fine on OS X 10.9.

Ralf
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 64-bit windows numpy / scipy wheels for testing

2014-04-28 Thread Sturla Molden
Ralf Gommers ralf.gomm...@gmail.com wrote:

 Sounds good. Let's give it a bit more time, once you've given it a good
 workout we can add that those gfortran 4.8.x compilers seem to work fine to
 the scipy build instructions.

Yes, it needs to be tested properly.

The build instructions for OS X Mavericks should also mention where to
obtain Xcode (Appstore) and the secret command to retrieve the command-line
utils after Xcode is installed:

$ /usr/bin/xcode-select --install

Probably it should also mention how to use alternative BLAS and LAPACK
versions (MKL and OpenBLAS), although all three are equally performant on
Mavericks (except Accelerate is not fork safe):

https://twitter.com/nedlom/status/437427557919891457

Sturla

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] should rint return int?

2014-04-28 Thread Neal Becker
I notice rint returns float.  Shouldn't it return int?

Would be useful when float is no longer acceptable as an index.  I think
conversion to an index using rint is a common idiom.

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Dates and times and Datetime64 (again)

2014-04-28 Thread Chris Barker
On Fri, Apr 25, 2014 at 4:57 AM, Andreas Hilboll li...@hilboll.de wrote:

 Array-wide access to the individual datetime components should work, i.e.,


datetime64array.year

 should yield an array of dtype int with the years.  That would allow
 boolean indexing to filter data, like

datetime64array[datetime64array.year == 2014]

 would yield all entries from 2014.


that would be nice, yes, but datetime64 doesn't support anything like that
at all -- i.e. array-wide or not access to the components. In this case,
you could kludge it with:

In [19]: datetimearray
Out[19]: array(['2014-02-03', '2013-03-08', '2012-03-07', '2014-04-06'],
dtype='datetime64[D]')

In [20]: datetimearray[datetimearray.astype('datetime64[Y]') ==
np.datetime64('2014')]
Out[20]: array(['2014-02-03', '2014-04-06'], dtype='datetime64[D]')

but that wouldn't work for months, for instance.

I think the current NEP should stick with simply fixing the timezone thing
-- no new functionality or consequence.

But:

Maybe it's time for a new NEP for what we want datetime64 to be in the
future -- maybe borrow from the blaze proposal cited earlier? Or wait and
see how that works out, then maybe port that code over to numpy?

In the meantime, a set of utilities that do the kind of things you're
looking for might make sense. You could do it as a ndarray subclass, and
add those sorts of methods, though ndarray subclasses do get messy

-Chris




-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] should rint return int?

2014-04-28 Thread Chris Barker
On Mon, Apr 28, 2014 at 10:36 AM, Neal Becker ndbeck...@gmail.com wrote:

 I notice rint returns float.  Shouldn't it return int?


AFAICT, rint() is the same as round(), except with slightly different rules
for the halfway case. So returning a float makes sense, as round() and
ceil() and floor() all do.

( though I've always thought those should return inegers, too... )

By the way, what IS the difference between rint and round?

In [37]: for val in [-2.5, -3.5, 2.5, 3.5]:
   : assert np.rint(val) == np.round(val)
   :

In [38]:

-CHB


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.go chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 64-bit windows numpy / scipy wheels for testing

2014-04-28 Thread Chris Barker
On Sun, Apr 27, 2014 at 2:46 PM, Matthew Brett matthew.br...@gmail.comwrote:

 As you know, I'm really hoping it will be possible make a devkit for
 Python similar to the Ruby devkits [1].


That would be great!

From a really quick glance, it looks like we could almost use the Ruby
Devkit, maybe adding a couple add-ons..

What do they do for 64 bit?

-Chris

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] should rint return int?

2014-04-28 Thread Robert Kern
On Mon, Apr 28, 2014 at 6:36 PM, Neal Becker ndbeck...@gmail.com wrote:
 I notice rint returns float.  Shouldn't it return int?

 Would be useful when float is no longer acceptable as an index.  I think
 conversion to an index using rint is a common idiom.

C's rint() does not:

  http://linux.die.net/man/3/rint

This is because there are many integers that are representable as
floats/doubles/long doubles that are well outside of the range of any
C integer type, e.g. 1e20.

Python 3's round() can return a Python int because Python ints are
unbounded. Ours aren't.

That said, typically the first thing anyone does with the result of
rounding is to coerce it to a native int dtype without any checking.
It would not be terrible to have a function that rounds, then coerces
to int but checks for overflow and passes that through the numpy error
mechanism to be controlled. But it shouldn't be called rint(), which
is intended to be as thin a wrapper over the C function as possible.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] should rint return int?

2014-04-28 Thread Neal Becker
Robert Kern wrote:

 On Mon, Apr 28, 2014 at 6:36 PM, Neal Becker ndbeck...@gmail.com wrote:
 I notice rint returns float.  Shouldn't it return int?

 Would be useful when float is no longer acceptable as an index.  I think
 conversion to an index using rint is a common idiom.
 
 C's rint() does not:
 
   http://linux.die.net/man/3/rint
 
 This is because there are many integers that are representable as
 floats/doubles/long doubles that are well outside of the range of any
 C integer type, e.g. 1e20.
 
 Python 3's round() can return a Python int because Python ints are
 unbounded. Ours aren't.
 
 That said, typically the first thing anyone does with the result of
 rounding is to coerce it to a native int dtype without any checking.
 It would not be terrible to have a function that rounds, then coerces
 to int but checks for overflow and passes that through the numpy error
 mechanism to be controlled. But it shouldn't be called rint(), which
 is intended to be as thin a wrapper over the C function as possible.
 

Well I'd spell it nint, and it works like:

def nint (x):
  return int (x + 0.5) if x = 0 else int (x - 0.5)

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] should rint return int?

2014-04-28 Thread Alan G Isaac
On 4/28/2014 3:29 PM, Neal Becker wrote:
 Well I'd spell it nint, and it works like:

Wouldn't it be simpler to add a dtype argument to `rint`?
Or does that violate the simple wrapper intent?

Alan Isaac

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] should rint return int?

2014-04-28 Thread Robert Kern
On Mon, Apr 28, 2014 at 8:36 PM, Alan G Isaac alan.is...@gmail.com wrote:
 On 4/28/2014 3:29 PM, Neal Becker wrote:
 Well I'd spell it nint, and it works like:

 Wouldn't it be simpler to add a dtype argument to `rint`?
 Or does that violate the simple wrapper intent?

`np.rint()` is a ufunc.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 64-bit windows numpy / scipy wheels for testing

2014-04-28 Thread Sturla Molden
On 28/04/14 18:21, Ralf Gommers wrote:

 No problems thus far, but I only installed it yesterday. :-)


 Sounds good. Let's give it a bit more time, once you've given it a good
 workout we can add that those gfortran 4.8.x compilers seem to work fine
 to the scipy build instructions.


I have not looked at building SciPy yet, but I was able to build MPICH 
3.0.4 from source without a problem. I worked on the first attempt 
without any error or warnings. That is more than I hoped for...

Using BLAS and LAPACK from Accelerate also worked correctly with flags 
-ff2c and -framework Accelerate. I can use it from Python (NumPy) with 
ctypes and Cython. I get correct results and it does not segfault.

(It does segfault without -ff2c, but that is as expected, given that 
Accelerate has f2c/g77 ABI.)

I was also able to build OpenBLAS with Clang as C compiler and gfortran 
as Fortran compiler. It works correctly as well (both the build process 
and the binaries I get).

So far it looks damn good :-)

The next step is to build NumPy and SciPy and run some tests :-)

Sturla





P.S. Here is what I did to build MPICH from source, for those interested:

$./configure CC=clang CXX=clang++ F77=gfortran FC=gfortran 
--enable-fast=all,O3 --with-pm=gforker --prefix=/opt/mpich
$ make
$ sudo make install

$ export PATH=/opt/mpich/bin:$PATH # actually in ~/.bash_profile

Now testing with some hello worlds:

$ mpif77 -o hello hello.f
$ mpiexec -np 4 ./hello
  Hello world
  Hello world
  Hello world
  Hello world


$ rm hello
$ mpicc -o hello hello.c
$ mpiexec -np 4 ./hello
Hello world from process 0 of 4
Hello world from process 1 of 4
Hello world from process 2 of 4
Hello world from process 3 of 4


The hello world programs looked like this:

#include stdio.h
#include mpi.h

int main (int argc, char *argv[])
{
   int rank, size;
   MPI_Init (argc, argv);
   MPI_Comm_rank (MPI_COMM_WORLD, rank);   
   MPI_Comm_size (MPI_COMM_WORLD, size);
   printf( Hello world from process %d of %d\n, rank, size);
   MPI_Finalize();
   return 0;
}

   program hello_world
   include 'mpif.h'
   integer ierr
   call MPI_INIT(ierr)
   print *, Hello world
   call MPI_FINALIZE(ierr)
   stop
   end







___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] should rint return int?

2014-04-28 Thread Nathaniel Smith
On 28 Apr 2014 20:22, Robert Kern robert.k...@gmail.com wrote:
 C's rint() does not:

   http://linux.die.net/man/3/rint

 This is because there are many integers that are representable as
 floats/doubles/long doubles that are well outside of the range of any
 C integer type, e.g. 1e20.

By the time you have a double integer that isn't representable as an int64,
you're well into the range where all doubles are integers but not all
integers are floats. Round to the nearest integer is already a pretty
semantically weird operation for such values.

I'm not sure what the consequences of this are for the discussion but it
seems worth pointing out.

 Python 3's round() can return a Python int because Python ints are
 unbounded. Ours aren't.

 That said, typically the first thing anyone does with the result of
 rounding is to coerce it to a native int dtype without any checking.
 It would not be terrible to have a function that rounds, then coerces
 to int but checks for overflow and passes that through the numpy error
 mechanism to be controlled.

It would help here if we had a consistent mechanism for handling integer
representability errors :-).

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] 64-bit windows numpy / scipy wheels for testing

2014-04-28 Thread David Cournapeau
On Sun, Apr 27, 2014 at 11:50 PM, Matthew Brett matthew.br...@gmail.comwrote:

 Aha,

 On Sun, Apr 27, 2014 at 3:19 PM, Matthew Brett matthew.br...@gmail.com
 wrote:
  Hi,
 
  On Sun, Apr 27, 2014 at 3:06 PM, Carl Kleffner cmkleff...@gmail.com
 wrote:
  A possible option is to install the toolchain inside site-packages and
 to
  deploy it as PYPI wheel or wininst packages. The PATH to the toolchain
 could
  be extended during import of the package. But I have no idea, whats the
 best
  strategy to additionaly install ATLAS or other third party libraries.
 
  Maybe we could provide ATLAS binaries for 32 / 64 bit as part of the
  devkit package.  It sounds like OpenBLAS will be much easier to build,
  so we could start with ATLAS binaries as a default, expecting OpenBLAS
  to be built more often with the toolchain.  I think that's how numpy
  binary installers are built at the moment - using old binary builds of
  ATLAS.
 
  I'm happy to provide the builds of ATLAS - e.g. here:
 
  https://nipy.bic.berkeley.edu/scipy_installers/atlas_builds

 I just found the official numpy binary builds of ATLAS:

 https://github.com/numpy/vendor/tree/master/binaries

 But - they are from an old version of ATLAS / Lapack, and only for 32-bit.

 David - what say we update these to latest ATLAS stable?


Fine by me (not that you need my approval !).

How easy is it to build ATLAS targetting a specific CPU these days ? I
think we need to at least support nosse and sse2 and above.

David


 Cheers,

 Matthew
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)

2014-04-28 Thread Nathaniel Smith
On Mon, Apr 28, 2014 at 11:25 AM, Michael Lehn michael.l...@uni-ulm.de wrote:

 Am 11 Apr 2014 um 19:05 schrieb Sturla Molden sturla.mol...@gmail.com:

 Sturla Molden sturla.mol...@gmail.com wrote:

 Making a totally new BLAS might seem like a crazy idea, but it might be the
 best solution in the long run.

 To see if this can be done, I'll try to re-implement cblas_dgemm and then
 benchmark against MKL, Accelerate and OpenBLAS. If I can get the
 performance better than 75% of their speed, without any assembly or dark

 So what percentage on performance did you achieve so far?

I finally read this paper:

   http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf

and I have to say that I'm no longer so convinced that OpenBLAS is the
right starting point. They make a compelling argument that BLIS *is*
the cleaned up, maintainable, and yet still competitive
reimplementation of GotoBLAS/OpenBLAS that we all want, and that
getting there required a qualitative reorganization of the code (i.e.,
very hard to do incrementally). But they've done it. And, I get the
impression that the stuff they're missing -- threading, cross-platform
build stuff, and runtime CPU adaptation -- is all pretty
straightforward stuff that is only missing because no-one's gotten
around to sitting down and implementing it. (In particular that paper
does include impressive threading results; it sounds like given a
decent thread pool library one could get competitive performance
pretty trivially, it's just that they haven't been bothered yet to do
thread pools properly or systematically test which of the pretty-good
approaches to threading is best. Which is important if your goal is
to write papers about BLAS libraries but irrelevant to reaching
minimal-viable-product stage.)

It would be really interesting if someone were to try hacking simple
runtime CPU detection into BLIS and see how far you could get -- right
now they do kernel selection via the C preprocessor, but hacking in
some function pointer thing instead would not be that hard I think. A
maintainable library that builds on Linux/OSX/Windows, gets
competitive performance on last-but-one generation x86-64 CPUs, and
gets better-than-reference-BLAS performance everywhere else, would be
a very very compelling product that I bet would quickly attract the
necessary attention to make it competitive on all CPUs.

-n

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)

2014-04-28 Thread Sturla Molden
On 29/04/14 01:30, Nathaniel Smith wrote:

 I finally read this paper:

 http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf

 and I have to say that I'm no longer so convinced that OpenBLAS is the
 right starting point.

I think OpenBLAS in the long run is doomed as an OSS project. Having 
huge portions of the source in assembly is not sustainable in 2014. 
OpenBLAS (like GotoBLAS2 before it) runs a high risk of becoming 
abandonware.

Sturla


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)

2014-04-28 Thread Nathaniel Smith
On Tue, Apr 29, 2014 at 12:52 AM, Sturla Molden sturla.mol...@gmail.com wrote:
 On 29/04/14 01:30, Nathaniel Smith wrote:

 I finally read this paper:

 http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf

 and I have to say that I'm no longer so convinced that OpenBLAS is the
 right starting point.

 I think OpenBLAS in the long run is doomed as an OSS project. Having
 huge portions of the source in assembly is not sustainable in 2014.
 OpenBLAS (like GotoBLAS2 before it) runs a high risk of becoming
 abandonware.

Have you read the paper I linked? I really recommend it. BLIS is
apparently 95% straight-up-C, plus a slot where you stick in a tiny
CPU-specific super-optimized kernel [1]. So this localizes the nasty
stuff to one tiny function, plus most of the kernels that have been
written so far do in fact use intrinsics [2].

[1] https://code.google.com/p/blis/wiki/KernelsHowTo
[2] https://code.google.com/p/blis/wiki/HardwareSupport

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)

2014-04-28 Thread Matthew Brett
Hi,

On Mon, Apr 28, 2014 at 4:30 PM, Nathaniel Smith n...@pobox.com wrote:
 On Mon, Apr 28, 2014 at 11:25 AM, Michael Lehn michael.l...@uni-ulm.de 
 wrote:

 Am 11 Apr 2014 um 19:05 schrieb Sturla Molden sturla.mol...@gmail.com:

 Sturla Molden sturla.mol...@gmail.com wrote:

 Making a totally new BLAS might seem like a crazy idea, but it might be the
 best solution in the long run.

 To see if this can be done, I'll try to re-implement cblas_dgemm and then
 benchmark against MKL, Accelerate and OpenBLAS. If I can get the
 performance better than 75% of their speed, without any assembly or dark

 So what percentage on performance did you achieve so far?

 I finally read this paper:

http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf

 and I have to say that I'm no longer so convinced that OpenBLAS is the
 right starting point. They make a compelling argument that BLIS *is*
 the cleaned up, maintainable, and yet still competitive
 reimplementation of GotoBLAS/OpenBLAS that we all want, and that
 getting there required a qualitative reorganization of the code (i.e.,
 very hard to do incrementally). But they've done it. And, I get the
 impression that the stuff they're missing -- threading, cross-platform
 build stuff, and runtime CPU adaptation -- is all pretty
 straightforward stuff that is only missing because no-one's gotten
 around to sitting down and implementing it. (In particular that paper
 does include impressive threading results; it sounds like given a
 decent thread pool library one could get competitive performance
 pretty trivially, it's just that they haven't been bothered yet to do
 thread pools properly or systematically test which of the pretty-good
 approaches to threading is best. Which is important if your goal is
 to write papers about BLAS libraries but irrelevant to reaching
 minimal-viable-product stage.)

 It would be really interesting if someone were to try hacking simple
 runtime CPU detection into BLIS and see how far you could get -- right
 now they do kernel selection via the C preprocessor, but hacking in
 some function pointer thing instead would not be that hard I think. A
 maintainable library that builds on Linux/OSX/Windows, gets
 competitive performance on last-but-one generation x86-64 CPUs, and
 gets better-than-reference-BLAS performance everywhere else, would be
 a very very compelling product that I bet would quickly attract the
 necessary attention to make it competitive on all CPUs.

I wonder - is there anyone who might be able to do this work, if we
found funding for a couple of months to do it?

Cheers,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)

2014-04-28 Thread Julian Taylor
On 29.04.2014 02:05, Matthew Brett wrote:
 Hi,
 
 On Mon, Apr 28, 2014 at 4:30 PM, Nathaniel Smith n...@pobox.com wrote:
 On Mon, Apr 28, 2014 at 11:25 AM, Michael Lehn michael.l...@uni-ulm.de 
 wrote:

 Am 11 Apr 2014 um 19:05 schrieb Sturla Molden sturla.mol...@gmail.com:

 Sturla Molden sturla.mol...@gmail.com wrote:

 Making a totally new BLAS might seem like a crazy idea, but it might be 
 the
 best solution in the long run.

 To see if this can be done, I'll try to re-implement cblas_dgemm and then
 benchmark against MKL, Accelerate and OpenBLAS. If I can get the
 performance better than 75% of their speed, without any assembly or dark

 So what percentage on performance did you achieve so far?

 I finally read this paper:

http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf

 and I have to say that I'm no longer so convinced that OpenBLAS is the
 right starting point. They make a compelling argument that BLIS *is*
 the cleaned up, maintainable, and yet still competitive
 reimplementation of GotoBLAS/OpenBLAS that we all want, and that
 getting there required a qualitative reorganization of the code (i.e.,
 very hard to do incrementally). But they've done it. And, I get the
 impression that the stuff they're missing -- threading, cross-platform
 build stuff, and runtime CPU adaptation -- is all pretty
 straightforward stuff that is only missing because no-one's gotten
 around to sitting down and implementing it. (In particular that paper
 does include impressive threading results; it sounds like given a
 decent thread pool library one could get competitive performance
 pretty trivially, it's just that they haven't been bothered yet to do
 thread pools properly or systematically test which of the pretty-good
 approaches to threading is best. Which is important if your goal is
 to write papers about BLAS libraries but irrelevant to reaching
 minimal-viable-product stage.)

 It would be really interesting if someone were to try hacking simple
 runtime CPU detection into BLIS and see how far you could get -- right
 now they do kernel selection via the C preprocessor, but hacking in
 some function pointer thing instead would not be that hard I think. A
 maintainable library that builds on Linux/OSX/Windows, gets
 competitive performance on last-but-one generation x86-64 CPUs, and
 gets better-than-reference-BLAS performance everywhere else, would be
 a very very compelling product that I bet would quickly attract the
 necessary attention to make it competitive on all CPUs.
 
 I wonder - is there anyone who might be able to do this work, if we
 found funding for a couple of months to do it?
 

On scipy-dev a interesting BLIS related message was posted recently:
http://mail.scipy.org/pipermail/scipy-dev/2014-April/019790.html
http://www.cs.utexas.edu/~flame/web/

It seems some work of integrating BLIS into a proper BLAS/LAPACK library
is already done.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)

2014-04-28 Thread Nathaniel Smith
On Tue, Apr 29, 2014 at 1:10 AM, Julian Taylor
jtaylor.deb...@googlemail.com wrote:
 On 29.04.2014 02:05, Matthew Brett wrote:
 Hi,

 On Mon, Apr 28, 2014 at 4:30 PM, Nathaniel Smith n...@pobox.com wrote:
 It would be really interesting if someone were to try hacking simple
 runtime CPU detection into BLIS and see how far you could get -- right
 now they do kernel selection via the C preprocessor, but hacking in
 some function pointer thing instead would not be that hard I think. A
 maintainable library that builds on Linux/OSX/Windows, gets
 competitive performance on last-but-one generation x86-64 CPUs, and
 gets better-than-reference-BLAS performance everywhere else, would be
 a very very compelling product that I bet would quickly attract the
 necessary attention to make it competitive on all CPUs.

 I wonder - is there anyone who might be able to do this work, if we
 found funding for a couple of months to do it?

 On scipy-dev a interesting BLIS related message was posted recently:
 http://mail.scipy.org/pipermail/scipy-dev/2014-April/019790.html
 http://www.cs.utexas.edu/~flame/web/

 It seems some work of integrating BLIS into a proper BLAS/LAPACK library
 is already done.

BLIS itself ships with a BLAS-compatible interface, that you can use
with reference LAPACK (just like OpenBLAS). I wouldn't be surprised if
there are various annoying Fortran/C ABI hacks remaining to be worked
out, but at least in principle BLIS is a BLAS. The problem is that
this BLAS has no threading, runtime configuration (you have to edit a
config file and recompile to change CPU support), or windows build
goop. Basically the authors seem to still be thinking of a BLAS
library's target audience as being supercomputer sysadmins, not naive
end-users.

-n

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)

2014-04-28 Thread Matthew Brett
Hi,

On Mon, Apr 28, 2014 at 5:10 PM, Julian Taylor
jtaylor.deb...@googlemail.com wrote:
 On 29.04.2014 02:05, Matthew Brett wrote:
 Hi,

 On Mon, Apr 28, 2014 at 4:30 PM, Nathaniel Smith n...@pobox.com wrote:
 On Mon, Apr 28, 2014 at 11:25 AM, Michael Lehn michael.l...@uni-ulm.de 
 wrote:

 Am 11 Apr 2014 um 19:05 schrieb Sturla Molden sturla.mol...@gmail.com:

 Sturla Molden sturla.mol...@gmail.com wrote:

 Making a totally new BLAS might seem like a crazy idea, but it might be 
 the
 best solution in the long run.

 To see if this can be done, I'll try to re-implement cblas_dgemm and then
 benchmark against MKL, Accelerate and OpenBLAS. If I can get the
 performance better than 75% of their speed, without any assembly or dark

 So what percentage on performance did you achieve so far?

 I finally read this paper:

http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf

 and I have to say that I'm no longer so convinced that OpenBLAS is the
 right starting point. They make a compelling argument that BLIS *is*
 the cleaned up, maintainable, and yet still competitive
 reimplementation of GotoBLAS/OpenBLAS that we all want, and that
 getting there required a qualitative reorganization of the code (i.e.,
 very hard to do incrementally). But they've done it. And, I get the
 impression that the stuff they're missing -- threading, cross-platform
 build stuff, and runtime CPU adaptation -- is all pretty
 straightforward stuff that is only missing because no-one's gotten
 around to sitting down and implementing it. (In particular that paper
 does include impressive threading results; it sounds like given a
 decent thread pool library one could get competitive performance
 pretty trivially, it's just that they haven't been bothered yet to do
 thread pools properly or systematically test which of the pretty-good
 approaches to threading is best. Which is important if your goal is
 to write papers about BLAS libraries but irrelevant to reaching
 minimal-viable-product stage.)

 It would be really interesting if someone were to try hacking simple
 runtime CPU detection into BLIS and see how far you could get -- right
 now they do kernel selection via the C preprocessor, but hacking in
 some function pointer thing instead would not be that hard I think. A
 maintainable library that builds on Linux/OSX/Windows, gets
 competitive performance on last-but-one generation x86-64 CPUs, and
 gets better-than-reference-BLAS performance everywhere else, would be
 a very very compelling product that I bet would quickly attract the
 necessary attention to make it competitive on all CPUs.

 I wonder - is there anyone who might be able to do this work, if we
 found funding for a couple of months to do it?


 On scipy-dev a interesting BLIS related message was posted recently:
 http://mail.scipy.org/pipermail/scipy-dev/2014-April/019790.html
 http://www.cs.utexas.edu/~flame/web/

 It seems some work of integrating BLIS into a proper BLAS/LAPACK library
 is already done.

Has anyone tried building scipy with libflame yet?

Cheers,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)

2014-04-28 Thread Nathaniel Smith
On Tue, Apr 29, 2014 at 1:05 AM, Matthew Brett matthew.br...@gmail.com wrote:
 Hi,

 On Mon, Apr 28, 2014 at 4:30 PM, Nathaniel Smith n...@pobox.com wrote:
 On Mon, Apr 28, 2014 at 11:25 AM, Michael Lehn michael.l...@uni-ulm.de 
 wrote:

 Am 11 Apr 2014 um 19:05 schrieb Sturla Molden sturla.mol...@gmail.com:

 Sturla Molden sturla.mol...@gmail.com wrote:

 Making a totally new BLAS might seem like a crazy idea, but it might be 
 the
 best solution in the long run.

 To see if this can be done, I'll try to re-implement cblas_dgemm and then
 benchmark against MKL, Accelerate and OpenBLAS. If I can get the
 performance better than 75% of their speed, without any assembly or dark

 So what percentage on performance did you achieve so far?

 I finally read this paper:

http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf

 and I have to say that I'm no longer so convinced that OpenBLAS is the
 right starting point. They make a compelling argument that BLIS *is*
 the cleaned up, maintainable, and yet still competitive
 reimplementation of GotoBLAS/OpenBLAS that we all want, and that
 getting there required a qualitative reorganization of the code (i.e.,
 very hard to do incrementally). But they've done it. And, I get the
 impression that the stuff they're missing -- threading, cross-platform
 build stuff, and runtime CPU adaptation -- is all pretty
 straightforward stuff that is only missing because no-one's gotten
 around to sitting down and implementing it. (In particular that paper
 does include impressive threading results; it sounds like given a
 decent thread pool library one could get competitive performance
 pretty trivially, it's just that they haven't been bothered yet to do
 thread pools properly or systematically test which of the pretty-good
 approaches to threading is best. Which is important if your goal is
 to write papers about BLAS libraries but irrelevant to reaching
 minimal-viable-product stage.)

 It would be really interesting if someone were to try hacking simple
 runtime CPU detection into BLIS and see how far you could get -- right
 now they do kernel selection via the C preprocessor, but hacking in
 some function pointer thing instead would not be that hard I think. A
 maintainable library that builds on Linux/OSX/Windows, gets
 competitive performance on last-but-one generation x86-64 CPUs, and
 gets better-than-reference-BLAS performance everywhere else, would be
 a very very compelling product that I bet would quickly attract the
 necessary attention to make it competitive on all CPUs.

 I wonder - is there anyone who might be able to do this work, if we
 found funding for a couple of months to do it?

Not much point in worrying about this I think until someone tries a
proof of concept. But potentially even the labs working on BLIS would
be interested in a small grant from NumFOCUS or something.

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)

2014-04-28 Thread Matthew Brett
Hi,

On Mon, Apr 28, 2014 at 5:50 PM, Nathaniel Smith n...@pobox.com wrote:
 On Tue, Apr 29, 2014 at 1:05 AM, Matthew Brett matthew.br...@gmail.com 
 wrote:
 Hi,

 On Mon, Apr 28, 2014 at 4:30 PM, Nathaniel Smith n...@pobox.com wrote:
 On Mon, Apr 28, 2014 at 11:25 AM, Michael Lehn michael.l...@uni-ulm.de 
 wrote:

 Am 11 Apr 2014 um 19:05 schrieb Sturla Molden sturla.mol...@gmail.com:

 Sturla Molden sturla.mol...@gmail.com wrote:

 Making a totally new BLAS might seem like a crazy idea, but it might be 
 the
 best solution in the long run.

 To see if this can be done, I'll try to re-implement cblas_dgemm and then
 benchmark against MKL, Accelerate and OpenBLAS. If I can get the
 performance better than 75% of their speed, without any assembly or dark

 So what percentage on performance did you achieve so far?

 I finally read this paper:

http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf

 and I have to say that I'm no longer so convinced that OpenBLAS is the
 right starting point. They make a compelling argument that BLIS *is*
 the cleaned up, maintainable, and yet still competitive
 reimplementation of GotoBLAS/OpenBLAS that we all want, and that
 getting there required a qualitative reorganization of the code (i.e.,
 very hard to do incrementally). But they've done it. And, I get the
 impression that the stuff they're missing -- threading, cross-platform
 build stuff, and runtime CPU adaptation -- is all pretty
 straightforward stuff that is only missing because no-one's gotten
 around to sitting down and implementing it. (In particular that paper
 does include impressive threading results; it sounds like given a
 decent thread pool library one could get competitive performance
 pretty trivially, it's just that they haven't been bothered yet to do
 thread pools properly or systematically test which of the pretty-good
 approaches to threading is best. Which is important if your goal is
 to write papers about BLAS libraries but irrelevant to reaching
 minimal-viable-product stage.)

 It would be really interesting if someone were to try hacking simple
 runtime CPU detection into BLIS and see how far you could get -- right
 now they do kernel selection via the C preprocessor, but hacking in
 some function pointer thing instead would not be that hard I think. A
 maintainable library that builds on Linux/OSX/Windows, gets
 competitive performance on last-but-one generation x86-64 CPUs, and
 gets better-than-reference-BLAS performance everywhere else, would be
 a very very compelling product that I bet would quickly attract the
 necessary attention to make it competitive on all CPUs.

 I wonder - is there anyone who might be able to do this work, if we
 found funding for a couple of months to do it?

 Not much point in worrying about this I think until someone tries a
 proof of concept. But potentially even the labs working on BLIS would
 be interested in a small grant from NumFOCUS or something.

The problem is the time and mental energy involved in the
proof-of-concept may be enough to prevent it being done, and having
some money to pay for time and to placate employers may be useful in
overcoming that.

To be clear - not me - I will certainly help if I can, but being paid
isn't going to help me work on this.

Cheers,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion