Re: [Numpy-discussion] The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)
Am 11 Apr 2014 um 19:05 schrieb Sturla Molden sturla.mol...@gmail.com: Sturla Molden sturla.mol...@gmail.com wrote: Making a totally new BLAS might seem like a crazy idea, but it might be the best solution in the long run. To see if this can be done, I'll try to re-implement cblas_dgemm and then benchmark against MKL, Accelerate and OpenBLAS. If I can get the performance better than 75% of their speed, without any assembly or dark So what percentage on performance did you achieve so far? Cheers, Michael ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Dates and times and Datetime64 (again)
On 19.04.2014 09:03, Andreas Hilboll wrote: On 14.04.2014 20:59, Chris Barker wrote: On Fri, Apr 11, 2014 at 4:58 PM, Stephan Hoyer sho...@gmail.com mailto:sho...@gmail.com wrote: On Fri, Apr 11, 2014 at 3:56 PM, Charles R Harris charlesr.har...@gmail.com mailto:charlesr.har...@gmail.com wrote: Are we in a position to start looking at implementation? If so, it would be useful to have a collection of test cases, i.e., typical uses with specified results. That should also cover conversion from/(to?) datetime.datetime. yup -- tests are always good! Indeed, my personal wish-list for np.datetime64 is centered much more on robust conversion to/from native date objects, including comparison. A good use case. Here are some of my particular points of frustration (apologies for the thread jacking!): - NaT should have similar behavior to NaN when used for comparisons (i.e., comparisons should always be False). make sense. - You can't compare a datetime object to a datetime64 object. that would be nice to have. - datetime64 objects with high precision (e.g., ns) can't compare to datetime objects. That's a problem, but how do you think it should be handled? My thought is that it should round to microseconds, and then compare -- kind of like comparing float32 and float64... Pandas has a very nice wrapper around datetime64 arrays that solves most of these issues, but it would be nice to get much of that functionality in core numpy, yes -- it would -- but learning from pandas is certainly a good idea. from numpy import datetime64 from datetime import datetime print np.datetime64('NaT') np.datetime64('2011-01-01') # this should not to true print datetime(2010, 1, 1) np.datetime64('2011-01-01') # raises exception print np.datetime64('2011-01-01T00:00', 'ns') datetime(2010, 1, 1) # another exception print np.datetime64('2011-01-01T00:00') datetime(2010, 1, 1) # finally something works! now to get them into proper unit tests As one further suggestion, I think it would be nice if doing arithmetic using np.datetime64 and datetime.timedelta objects would work: np.datetime64(2011,1,1) + datetime.timedelta(1) == np.datetime64(2011,1,2) And of course, but this is probably in the loop anyways, np.asarray([list_of_datetime.datetime_objects]) should work as expected. One more wish / suggestion from my side (apologies if this isn't the place to make wishes): Array-wide access to the individual datetime components should work, i.e., datetime64array.year should yield an array of dtype int with the years. That would allow boolean indexing to filter data, like datetime64array[datetime64array.year == 2014] would yield all entries from 2014. Cheers, -- Andreas. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 64-bit windows numpy / scipy wheels for testing
On Mon, Apr 28, 2014 at 12:39 AM, Sturla Molden sturla.mol...@gmail.comwrote: Pauli Virtanen p...@iki.fi wrote: Yes, Windows is the only platform on which Fortran was problematic. OSX is somewhat saner in this respect. Oh yes, it seems there are official unofficial gfortran binaries available for OSX: http://gcc.gnu.org/wiki/GFortranBinaries#MacOS I'd be interested to hear if those work well for you. For people that just want to get things working, I would recommend to use the gfortran installers recommended at http://scipy.org/scipylib/building/macosx.html#compilers-c-c-fortran-cython. Those work for sure, and alternatives have usually proven to be problematic in the past. Ralf Cool :) Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 64-bit windows numpy / scipy wheels for testing
Ralf Gommers ralf.gomm...@gmail.com wrote: I'd be interested to hear if those work well for you. For people that just want to get things working, I would recommend to use the gfortran installers recommended at a href=http://scipy.org/scipylib/building/macosx.html#compilers-c-c-fortran-cython.;http://scipy.org/scipylib/building/macosx.html#compilers-c-c-fortran-cython./a Those work for sure, and alternatives have usually proven to be problematic in the past. No problems thus far, but I only installed it yesterday. :-) I am not sure gcc-4.2 is needed anymore. Apple has retired it as platform C compiler on OS X. We need a Fortran compiler that can be used together with clang as C compiler. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 64-bit windows numpy / scipy wheels for testing
On Mon, Apr 28, 2014 at 6:06 PM, Sturla Molden sturla.mol...@gmail.comwrote: Ralf Gommers ralf.gomm...@gmail.com wrote: I'd be interested to hear if those work well for you. For people that just want to get things working, I would recommend to use the gfortran installers recommended at a href= http://scipy.org/scipylib/building/macosx.html#compilers-c-c-fortran-cython . http://scipy.org/scipylib/building/macosx.html#compilers-c-c-fortran-cython ./a Those work for sure, and alternatives have usually proven to be problematic in the past. No problems thus far, but I only installed it yesterday. :-) Sounds good. Let's give it a bit more time, once you've given it a good workout we can add that those gfortran 4.8.x compilers seem to work fine to the scipy build instructions. I am not sure gcc-4.2 is needed anymore. Apple has retired it as platform C compiler on OS X. We need a Fortran compiler that can be used together with clang as C compiler. Clang together with gfortran 4.2 works fine on OS X 10.9. Ralf ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 64-bit windows numpy / scipy wheels for testing
Ralf Gommers ralf.gomm...@gmail.com wrote: Sounds good. Let's give it a bit more time, once you've given it a good workout we can add that those gfortran 4.8.x compilers seem to work fine to the scipy build instructions. Yes, it needs to be tested properly. The build instructions for OS X Mavericks should also mention where to obtain Xcode (Appstore) and the secret command to retrieve the command-line utils after Xcode is installed: $ /usr/bin/xcode-select --install Probably it should also mention how to use alternative BLAS and LAPACK versions (MKL and OpenBLAS), although all three are equally performant on Mavericks (except Accelerate is not fork safe): https://twitter.com/nedlom/status/437427557919891457 Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] should rint return int?
I notice rint returns float. Shouldn't it return int? Would be useful when float is no longer acceptable as an index. I think conversion to an index using rint is a common idiom. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Dates and times and Datetime64 (again)
On Fri, Apr 25, 2014 at 4:57 AM, Andreas Hilboll li...@hilboll.de wrote: Array-wide access to the individual datetime components should work, i.e., datetime64array.year should yield an array of dtype int with the years. That would allow boolean indexing to filter data, like datetime64array[datetime64array.year == 2014] would yield all entries from 2014. that would be nice, yes, but datetime64 doesn't support anything like that at all -- i.e. array-wide or not access to the components. In this case, you could kludge it with: In [19]: datetimearray Out[19]: array(['2014-02-03', '2013-03-08', '2012-03-07', '2014-04-06'], dtype='datetime64[D]') In [20]: datetimearray[datetimearray.astype('datetime64[Y]') == np.datetime64('2014')] Out[20]: array(['2014-02-03', '2014-04-06'], dtype='datetime64[D]') but that wouldn't work for months, for instance. I think the current NEP should stick with simply fixing the timezone thing -- no new functionality or consequence. But: Maybe it's time for a new NEP for what we want datetime64 to be in the future -- maybe borrow from the blaze proposal cited earlier? Or wait and see how that works out, then maybe port that code over to numpy? In the meantime, a set of utilities that do the kind of things you're looking for might make sense. You could do it as a ndarray subclass, and add those sorts of methods, though ndarray subclasses do get messy -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] should rint return int?
On Mon, Apr 28, 2014 at 10:36 AM, Neal Becker ndbeck...@gmail.com wrote: I notice rint returns float. Shouldn't it return int? AFAICT, rint() is the same as round(), except with slightly different rules for the halfway case. So returning a float makes sense, as round() and ceil() and floor() all do. ( though I've always thought those should return inegers, too... ) By the way, what IS the difference between rint and round? In [37]: for val in [-2.5, -3.5, 2.5, 3.5]: : assert np.rint(val) == np.round(val) : In [38]: -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.go chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 64-bit windows numpy / scipy wheels for testing
On Sun, Apr 27, 2014 at 2:46 PM, Matthew Brett matthew.br...@gmail.comwrote: As you know, I'm really hoping it will be possible make a devkit for Python similar to the Ruby devkits [1]. That would be great! From a really quick glance, it looks like we could almost use the Ruby Devkit, maybe adding a couple add-ons.. What do they do for 64 bit? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] should rint return int?
On Mon, Apr 28, 2014 at 6:36 PM, Neal Becker ndbeck...@gmail.com wrote: I notice rint returns float. Shouldn't it return int? Would be useful when float is no longer acceptable as an index. I think conversion to an index using rint is a common idiom. C's rint() does not: http://linux.die.net/man/3/rint This is because there are many integers that are representable as floats/doubles/long doubles that are well outside of the range of any C integer type, e.g. 1e20. Python 3's round() can return a Python int because Python ints are unbounded. Ours aren't. That said, typically the first thing anyone does with the result of rounding is to coerce it to a native int dtype without any checking. It would not be terrible to have a function that rounds, then coerces to int but checks for overflow and passes that through the numpy error mechanism to be controlled. But it shouldn't be called rint(), which is intended to be as thin a wrapper over the C function as possible. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] should rint return int?
Robert Kern wrote: On Mon, Apr 28, 2014 at 6:36 PM, Neal Becker ndbeck...@gmail.com wrote: I notice rint returns float. Shouldn't it return int? Would be useful when float is no longer acceptable as an index. I think conversion to an index using rint is a common idiom. C's rint() does not: http://linux.die.net/man/3/rint This is because there are many integers that are representable as floats/doubles/long doubles that are well outside of the range of any C integer type, e.g. 1e20. Python 3's round() can return a Python int because Python ints are unbounded. Ours aren't. That said, typically the first thing anyone does with the result of rounding is to coerce it to a native int dtype without any checking. It would not be terrible to have a function that rounds, then coerces to int but checks for overflow and passes that through the numpy error mechanism to be controlled. But it shouldn't be called rint(), which is intended to be as thin a wrapper over the C function as possible. Well I'd spell it nint, and it works like: def nint (x): return int (x + 0.5) if x = 0 else int (x - 0.5) ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] should rint return int?
On 4/28/2014 3:29 PM, Neal Becker wrote: Well I'd spell it nint, and it works like: Wouldn't it be simpler to add a dtype argument to `rint`? Or does that violate the simple wrapper intent? Alan Isaac ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] should rint return int?
On Mon, Apr 28, 2014 at 8:36 PM, Alan G Isaac alan.is...@gmail.com wrote: On 4/28/2014 3:29 PM, Neal Becker wrote: Well I'd spell it nint, and it works like: Wouldn't it be simpler to add a dtype argument to `rint`? Or does that violate the simple wrapper intent? `np.rint()` is a ufunc. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 64-bit windows numpy / scipy wheels for testing
On 28/04/14 18:21, Ralf Gommers wrote: No problems thus far, but I only installed it yesterday. :-) Sounds good. Let's give it a bit more time, once you've given it a good workout we can add that those gfortran 4.8.x compilers seem to work fine to the scipy build instructions. I have not looked at building SciPy yet, but I was able to build MPICH 3.0.4 from source without a problem. I worked on the first attempt without any error or warnings. That is more than I hoped for... Using BLAS and LAPACK from Accelerate also worked correctly with flags -ff2c and -framework Accelerate. I can use it from Python (NumPy) with ctypes and Cython. I get correct results and it does not segfault. (It does segfault without -ff2c, but that is as expected, given that Accelerate has f2c/g77 ABI.) I was also able to build OpenBLAS with Clang as C compiler and gfortran as Fortran compiler. It works correctly as well (both the build process and the binaries I get). So far it looks damn good :-) The next step is to build NumPy and SciPy and run some tests :-) Sturla P.S. Here is what I did to build MPICH from source, for those interested: $./configure CC=clang CXX=clang++ F77=gfortran FC=gfortran --enable-fast=all,O3 --with-pm=gforker --prefix=/opt/mpich $ make $ sudo make install $ export PATH=/opt/mpich/bin:$PATH # actually in ~/.bash_profile Now testing with some hello worlds: $ mpif77 -o hello hello.f $ mpiexec -np 4 ./hello Hello world Hello world Hello world Hello world $ rm hello $ mpicc -o hello hello.c $ mpiexec -np 4 ./hello Hello world from process 0 of 4 Hello world from process 1 of 4 Hello world from process 2 of 4 Hello world from process 3 of 4 The hello world programs looked like this: #include stdio.h #include mpi.h int main (int argc, char *argv[]) { int rank, size; MPI_Init (argc, argv); MPI_Comm_rank (MPI_COMM_WORLD, rank); MPI_Comm_size (MPI_COMM_WORLD, size); printf( Hello world from process %d of %d\n, rank, size); MPI_Finalize(); return 0; } program hello_world include 'mpif.h' integer ierr call MPI_INIT(ierr) print *, Hello world call MPI_FINALIZE(ierr) stop end ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] should rint return int?
On 28 Apr 2014 20:22, Robert Kern robert.k...@gmail.com wrote: C's rint() does not: http://linux.die.net/man/3/rint This is because there are many integers that are representable as floats/doubles/long doubles that are well outside of the range of any C integer type, e.g. 1e20. By the time you have a double integer that isn't representable as an int64, you're well into the range where all doubles are integers but not all integers are floats. Round to the nearest integer is already a pretty semantically weird operation for such values. I'm not sure what the consequences of this are for the discussion but it seems worth pointing out. Python 3's round() can return a Python int because Python ints are unbounded. Ours aren't. That said, typically the first thing anyone does with the result of rounding is to coerce it to a native int dtype without any checking. It would not be terrible to have a function that rounds, then coerces to int but checks for overflow and passes that through the numpy error mechanism to be controlled. It would help here if we had a consistent mechanism for handling integer representability errors :-). -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 64-bit windows numpy / scipy wheels for testing
On Sun, Apr 27, 2014 at 11:50 PM, Matthew Brett matthew.br...@gmail.comwrote: Aha, On Sun, Apr 27, 2014 at 3:19 PM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Sun, Apr 27, 2014 at 3:06 PM, Carl Kleffner cmkleff...@gmail.com wrote: A possible option is to install the toolchain inside site-packages and to deploy it as PYPI wheel or wininst packages. The PATH to the toolchain could be extended during import of the package. But I have no idea, whats the best strategy to additionaly install ATLAS or other third party libraries. Maybe we could provide ATLAS binaries for 32 / 64 bit as part of the devkit package. It sounds like OpenBLAS will be much easier to build, so we could start with ATLAS binaries as a default, expecting OpenBLAS to be built more often with the toolchain. I think that's how numpy binary installers are built at the moment - using old binary builds of ATLAS. I'm happy to provide the builds of ATLAS - e.g. here: https://nipy.bic.berkeley.edu/scipy_installers/atlas_builds I just found the official numpy binary builds of ATLAS: https://github.com/numpy/vendor/tree/master/binaries But - they are from an old version of ATLAS / Lapack, and only for 32-bit. David - what say we update these to latest ATLAS stable? Fine by me (not that you need my approval !). How easy is it to build ATLAS targetting a specific CPU these days ? I think we need to at least support nosse and sse2 and above. David Cheers, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)
On Mon, Apr 28, 2014 at 11:25 AM, Michael Lehn michael.l...@uni-ulm.de wrote: Am 11 Apr 2014 um 19:05 schrieb Sturla Molden sturla.mol...@gmail.com: Sturla Molden sturla.mol...@gmail.com wrote: Making a totally new BLAS might seem like a crazy idea, but it might be the best solution in the long run. To see if this can be done, I'll try to re-implement cblas_dgemm and then benchmark against MKL, Accelerate and OpenBLAS. If I can get the performance better than 75% of their speed, without any assembly or dark So what percentage on performance did you achieve so far? I finally read this paper: http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf and I have to say that I'm no longer so convinced that OpenBLAS is the right starting point. They make a compelling argument that BLIS *is* the cleaned up, maintainable, and yet still competitive reimplementation of GotoBLAS/OpenBLAS that we all want, and that getting there required a qualitative reorganization of the code (i.e., very hard to do incrementally). But they've done it. And, I get the impression that the stuff they're missing -- threading, cross-platform build stuff, and runtime CPU adaptation -- is all pretty straightforward stuff that is only missing because no-one's gotten around to sitting down and implementing it. (In particular that paper does include impressive threading results; it sounds like given a decent thread pool library one could get competitive performance pretty trivially, it's just that they haven't been bothered yet to do thread pools properly or systematically test which of the pretty-good approaches to threading is best. Which is important if your goal is to write papers about BLAS libraries but irrelevant to reaching minimal-viable-product stage.) It would be really interesting if someone were to try hacking simple runtime CPU detection into BLIS and see how far you could get -- right now they do kernel selection via the C preprocessor, but hacking in some function pointer thing instead would not be that hard I think. A maintainable library that builds on Linux/OSX/Windows, gets competitive performance on last-but-one generation x86-64 CPUs, and gets better-than-reference-BLAS performance everywhere else, would be a very very compelling product that I bet would quickly attract the necessary attention to make it competitive on all CPUs. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)
On 29/04/14 01:30, Nathaniel Smith wrote: I finally read this paper: http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf and I have to say that I'm no longer so convinced that OpenBLAS is the right starting point. I think OpenBLAS in the long run is doomed as an OSS project. Having huge portions of the source in assembly is not sustainable in 2014. OpenBLAS (like GotoBLAS2 before it) runs a high risk of becoming abandonware. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)
On Tue, Apr 29, 2014 at 12:52 AM, Sturla Molden sturla.mol...@gmail.com wrote: On 29/04/14 01:30, Nathaniel Smith wrote: I finally read this paper: http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf and I have to say that I'm no longer so convinced that OpenBLAS is the right starting point. I think OpenBLAS in the long run is doomed as an OSS project. Having huge portions of the source in assembly is not sustainable in 2014. OpenBLAS (like GotoBLAS2 before it) runs a high risk of becoming abandonware. Have you read the paper I linked? I really recommend it. BLIS is apparently 95% straight-up-C, plus a slot where you stick in a tiny CPU-specific super-optimized kernel [1]. So this localizes the nasty stuff to one tiny function, plus most of the kernels that have been written so far do in fact use intrinsics [2]. [1] https://code.google.com/p/blis/wiki/KernelsHowTo [2] https://code.google.com/p/blis/wiki/HardwareSupport -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)
Hi, On Mon, Apr 28, 2014 at 4:30 PM, Nathaniel Smith n...@pobox.com wrote: On Mon, Apr 28, 2014 at 11:25 AM, Michael Lehn michael.l...@uni-ulm.de wrote: Am 11 Apr 2014 um 19:05 schrieb Sturla Molden sturla.mol...@gmail.com: Sturla Molden sturla.mol...@gmail.com wrote: Making a totally new BLAS might seem like a crazy idea, but it might be the best solution in the long run. To see if this can be done, I'll try to re-implement cblas_dgemm and then benchmark against MKL, Accelerate and OpenBLAS. If I can get the performance better than 75% of their speed, without any assembly or dark So what percentage on performance did you achieve so far? I finally read this paper: http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf and I have to say that I'm no longer so convinced that OpenBLAS is the right starting point. They make a compelling argument that BLIS *is* the cleaned up, maintainable, and yet still competitive reimplementation of GotoBLAS/OpenBLAS that we all want, and that getting there required a qualitative reorganization of the code (i.e., very hard to do incrementally). But they've done it. And, I get the impression that the stuff they're missing -- threading, cross-platform build stuff, and runtime CPU adaptation -- is all pretty straightforward stuff that is only missing because no-one's gotten around to sitting down and implementing it. (In particular that paper does include impressive threading results; it sounds like given a decent thread pool library one could get competitive performance pretty trivially, it's just that they haven't been bothered yet to do thread pools properly or systematically test which of the pretty-good approaches to threading is best. Which is important if your goal is to write papers about BLAS libraries but irrelevant to reaching minimal-viable-product stage.) It would be really interesting if someone were to try hacking simple runtime CPU detection into BLIS and see how far you could get -- right now they do kernel selection via the C preprocessor, but hacking in some function pointer thing instead would not be that hard I think. A maintainable library that builds on Linux/OSX/Windows, gets competitive performance on last-but-one generation x86-64 CPUs, and gets better-than-reference-BLAS performance everywhere else, would be a very very compelling product that I bet would quickly attract the necessary attention to make it competitive on all CPUs. I wonder - is there anyone who might be able to do this work, if we found funding for a couple of months to do it? Cheers, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)
On 29.04.2014 02:05, Matthew Brett wrote: Hi, On Mon, Apr 28, 2014 at 4:30 PM, Nathaniel Smith n...@pobox.com wrote: On Mon, Apr 28, 2014 at 11:25 AM, Michael Lehn michael.l...@uni-ulm.de wrote: Am 11 Apr 2014 um 19:05 schrieb Sturla Molden sturla.mol...@gmail.com: Sturla Molden sturla.mol...@gmail.com wrote: Making a totally new BLAS might seem like a crazy idea, but it might be the best solution in the long run. To see if this can be done, I'll try to re-implement cblas_dgemm and then benchmark against MKL, Accelerate and OpenBLAS. If I can get the performance better than 75% of their speed, without any assembly or dark So what percentage on performance did you achieve so far? I finally read this paper: http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf and I have to say that I'm no longer so convinced that OpenBLAS is the right starting point. They make a compelling argument that BLIS *is* the cleaned up, maintainable, and yet still competitive reimplementation of GotoBLAS/OpenBLAS that we all want, and that getting there required a qualitative reorganization of the code (i.e., very hard to do incrementally). But they've done it. And, I get the impression that the stuff they're missing -- threading, cross-platform build stuff, and runtime CPU adaptation -- is all pretty straightforward stuff that is only missing because no-one's gotten around to sitting down and implementing it. (In particular that paper does include impressive threading results; it sounds like given a decent thread pool library one could get competitive performance pretty trivially, it's just that they haven't been bothered yet to do thread pools properly or systematically test which of the pretty-good approaches to threading is best. Which is important if your goal is to write papers about BLAS libraries but irrelevant to reaching minimal-viable-product stage.) It would be really interesting if someone were to try hacking simple runtime CPU detection into BLIS and see how far you could get -- right now they do kernel selection via the C preprocessor, but hacking in some function pointer thing instead would not be that hard I think. A maintainable library that builds on Linux/OSX/Windows, gets competitive performance on last-but-one generation x86-64 CPUs, and gets better-than-reference-BLAS performance everywhere else, would be a very very compelling product that I bet would quickly attract the necessary attention to make it competitive on all CPUs. I wonder - is there anyone who might be able to do this work, if we found funding for a couple of months to do it? On scipy-dev a interesting BLIS related message was posted recently: http://mail.scipy.org/pipermail/scipy-dev/2014-April/019790.html http://www.cs.utexas.edu/~flame/web/ It seems some work of integrating BLIS into a proper BLAS/LAPACK library is already done. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)
On Tue, Apr 29, 2014 at 1:10 AM, Julian Taylor jtaylor.deb...@googlemail.com wrote: On 29.04.2014 02:05, Matthew Brett wrote: Hi, On Mon, Apr 28, 2014 at 4:30 PM, Nathaniel Smith n...@pobox.com wrote: It would be really interesting if someone were to try hacking simple runtime CPU detection into BLIS and see how far you could get -- right now they do kernel selection via the C preprocessor, but hacking in some function pointer thing instead would not be that hard I think. A maintainable library that builds on Linux/OSX/Windows, gets competitive performance on last-but-one generation x86-64 CPUs, and gets better-than-reference-BLAS performance everywhere else, would be a very very compelling product that I bet would quickly attract the necessary attention to make it competitive on all CPUs. I wonder - is there anyone who might be able to do this work, if we found funding for a couple of months to do it? On scipy-dev a interesting BLIS related message was posted recently: http://mail.scipy.org/pipermail/scipy-dev/2014-April/019790.html http://www.cs.utexas.edu/~flame/web/ It seems some work of integrating BLIS into a proper BLAS/LAPACK library is already done. BLIS itself ships with a BLAS-compatible interface, that you can use with reference LAPACK (just like OpenBLAS). I wouldn't be surprised if there are various annoying Fortran/C ABI hacks remaining to be worked out, but at least in principle BLIS is a BLAS. The problem is that this BLAS has no threading, runtime configuration (you have to edit a config file and recompile to change CPU support), or windows build goop. Basically the authors seem to still be thinking of a BLAS library's target audience as being supercomputer sysadmins, not naive end-users. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)
Hi, On Mon, Apr 28, 2014 at 5:10 PM, Julian Taylor jtaylor.deb...@googlemail.com wrote: On 29.04.2014 02:05, Matthew Brett wrote: Hi, On Mon, Apr 28, 2014 at 4:30 PM, Nathaniel Smith n...@pobox.com wrote: On Mon, Apr 28, 2014 at 11:25 AM, Michael Lehn michael.l...@uni-ulm.de wrote: Am 11 Apr 2014 um 19:05 schrieb Sturla Molden sturla.mol...@gmail.com: Sturla Molden sturla.mol...@gmail.com wrote: Making a totally new BLAS might seem like a crazy idea, but it might be the best solution in the long run. To see if this can be done, I'll try to re-implement cblas_dgemm and then benchmark against MKL, Accelerate and OpenBLAS. If I can get the performance better than 75% of their speed, without any assembly or dark So what percentage on performance did you achieve so far? I finally read this paper: http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf and I have to say that I'm no longer so convinced that OpenBLAS is the right starting point. They make a compelling argument that BLIS *is* the cleaned up, maintainable, and yet still competitive reimplementation of GotoBLAS/OpenBLAS that we all want, and that getting there required a qualitative reorganization of the code (i.e., very hard to do incrementally). But they've done it. And, I get the impression that the stuff they're missing -- threading, cross-platform build stuff, and runtime CPU adaptation -- is all pretty straightforward stuff that is only missing because no-one's gotten around to sitting down and implementing it. (In particular that paper does include impressive threading results; it sounds like given a decent thread pool library one could get competitive performance pretty trivially, it's just that they haven't been bothered yet to do thread pools properly or systematically test which of the pretty-good approaches to threading is best. Which is important if your goal is to write papers about BLAS libraries but irrelevant to reaching minimal-viable-product stage.) It would be really interesting if someone were to try hacking simple runtime CPU detection into BLIS and see how far you could get -- right now they do kernel selection via the C preprocessor, but hacking in some function pointer thing instead would not be that hard I think. A maintainable library that builds on Linux/OSX/Windows, gets competitive performance on last-but-one generation x86-64 CPUs, and gets better-than-reference-BLAS performance everywhere else, would be a very very compelling product that I bet would quickly attract the necessary attention to make it competitive on all CPUs. I wonder - is there anyone who might be able to do this work, if we found funding for a couple of months to do it? On scipy-dev a interesting BLIS related message was posted recently: http://mail.scipy.org/pipermail/scipy-dev/2014-April/019790.html http://www.cs.utexas.edu/~flame/web/ It seems some work of integrating BLIS into a proper BLAS/LAPACK library is already done. Has anyone tried building scipy with libflame yet? Cheers, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)
On Tue, Apr 29, 2014 at 1:05 AM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Mon, Apr 28, 2014 at 4:30 PM, Nathaniel Smith n...@pobox.com wrote: On Mon, Apr 28, 2014 at 11:25 AM, Michael Lehn michael.l...@uni-ulm.de wrote: Am 11 Apr 2014 um 19:05 schrieb Sturla Molden sturla.mol...@gmail.com: Sturla Molden sturla.mol...@gmail.com wrote: Making a totally new BLAS might seem like a crazy idea, but it might be the best solution in the long run. To see if this can be done, I'll try to re-implement cblas_dgemm and then benchmark against MKL, Accelerate and OpenBLAS. If I can get the performance better than 75% of their speed, without any assembly or dark So what percentage on performance did you achieve so far? I finally read this paper: http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf and I have to say that I'm no longer so convinced that OpenBLAS is the right starting point. They make a compelling argument that BLIS *is* the cleaned up, maintainable, and yet still competitive reimplementation of GotoBLAS/OpenBLAS that we all want, and that getting there required a qualitative reorganization of the code (i.e., very hard to do incrementally). But they've done it. And, I get the impression that the stuff they're missing -- threading, cross-platform build stuff, and runtime CPU adaptation -- is all pretty straightforward stuff that is only missing because no-one's gotten around to sitting down and implementing it. (In particular that paper does include impressive threading results; it sounds like given a decent thread pool library one could get competitive performance pretty trivially, it's just that they haven't been bothered yet to do thread pools properly or systematically test which of the pretty-good approaches to threading is best. Which is important if your goal is to write papers about BLAS libraries but irrelevant to reaching minimal-viable-product stage.) It would be really interesting if someone were to try hacking simple runtime CPU detection into BLIS and see how far you could get -- right now they do kernel selection via the C preprocessor, but hacking in some function pointer thing instead would not be that hard I think. A maintainable library that builds on Linux/OSX/Windows, gets competitive performance on last-but-one generation x86-64 CPUs, and gets better-than-reference-BLAS performance everywhere else, would be a very very compelling product that I bet would quickly attract the necessary attention to make it competitive on all CPUs. I wonder - is there anyone who might be able to do this work, if we found funding for a couple of months to do it? Not much point in worrying about this I think until someone tries a proof of concept. But potentially even the labs working on BLIS would be interested in a small grant from NumFOCUS or something. -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] The BLAS problem (was: Re: Wiki page for building numerical stuff on Windows)
Hi, On Mon, Apr 28, 2014 at 5:50 PM, Nathaniel Smith n...@pobox.com wrote: On Tue, Apr 29, 2014 at 1:05 AM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Mon, Apr 28, 2014 at 4:30 PM, Nathaniel Smith n...@pobox.com wrote: On Mon, Apr 28, 2014 at 11:25 AM, Michael Lehn michael.l...@uni-ulm.de wrote: Am 11 Apr 2014 um 19:05 schrieb Sturla Molden sturla.mol...@gmail.com: Sturla Molden sturla.mol...@gmail.com wrote: Making a totally new BLAS might seem like a crazy idea, but it might be the best solution in the long run. To see if this can be done, I'll try to re-implement cblas_dgemm and then benchmark against MKL, Accelerate and OpenBLAS. If I can get the performance better than 75% of their speed, without any assembly or dark So what percentage on performance did you achieve so far? I finally read this paper: http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf and I have to say that I'm no longer so convinced that OpenBLAS is the right starting point. They make a compelling argument that BLIS *is* the cleaned up, maintainable, and yet still competitive reimplementation of GotoBLAS/OpenBLAS that we all want, and that getting there required a qualitative reorganization of the code (i.e., very hard to do incrementally). But they've done it. And, I get the impression that the stuff they're missing -- threading, cross-platform build stuff, and runtime CPU adaptation -- is all pretty straightforward stuff that is only missing because no-one's gotten around to sitting down and implementing it. (In particular that paper does include impressive threading results; it sounds like given a decent thread pool library one could get competitive performance pretty trivially, it's just that they haven't been bothered yet to do thread pools properly or systematically test which of the pretty-good approaches to threading is best. Which is important if your goal is to write papers about BLAS libraries but irrelevant to reaching minimal-viable-product stage.) It would be really interesting if someone were to try hacking simple runtime CPU detection into BLIS and see how far you could get -- right now they do kernel selection via the C preprocessor, but hacking in some function pointer thing instead would not be that hard I think. A maintainable library that builds on Linux/OSX/Windows, gets competitive performance on last-but-one generation x86-64 CPUs, and gets better-than-reference-BLAS performance everywhere else, would be a very very compelling product that I bet would quickly attract the necessary attention to make it competitive on all CPUs. I wonder - is there anyone who might be able to do this work, if we found funding for a couple of months to do it? Not much point in worrying about this I think until someone tries a proof of concept. But potentially even the labs working on BLIS would be interested in a small grant from NumFOCUS or something. The problem is the time and mental energy involved in the proof-of-concept may be enough to prevent it being done, and having some money to pay for time and to placate employers may be useful in overcoming that. To be clear - not me - I will certainly help if I can, but being paid isn't going to help me work on this. Cheers, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion