Hi, On Mon, Apr 28, 2014 at 4:30 PM, Nathaniel Smith <n...@pobox.com> wrote: > On Mon, Apr 28, 2014 at 11:25 AM, Michael Lehn <michael.l...@uni-ulm.de> > wrote: >> >> Am 11 Apr 2014 um 19:05 schrieb Sturla Molden <sturla.mol...@gmail.com>: >> >>> Sturla Molden <sturla.mol...@gmail.com> wrote: >>> >>>> Making a totally new BLAS might seem like a crazy idea, but it might be the >>>> best solution in the long run. >>> >>> To see if this can be done, I'll try to re-implement cblas_dgemm and then >>> benchmark against MKL, Accelerate and OpenBLAS. If I can get the >>> performance better than 75% of their speed, without any assembly or dark >> >> So what percentage on performance did you achieve so far? > > I finally read this paper: > > http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev2.pdf > > and I have to say that I'm no longer so convinced that OpenBLAS is the > right starting point. They make a compelling argument that BLIS *is* > the cleaned up, maintainable, and yet still competitive > reimplementation of GotoBLAS/OpenBLAS that we all want, and that > getting there required a qualitative reorganization of the code (i.e., > very hard to do incrementally). But they've done it. And, I get the > impression that the stuff they're missing -- threading, cross-platform > build stuff, and runtime CPU adaptation -- is all pretty > straightforward stuff that is only missing because no-one's gotten > around to sitting down and implementing it. (In particular that paper > does include impressive threading results; it sounds like given a > decent thread pool library one could get competitive performance > pretty trivially, it's just that they haven't been bothered yet to do > thread pools properly or systematically test which of the pretty-good > approaches to threading is "best". Which is important if your goal is > to write papers about BLAS libraries but irrelevant to reaching > minimal-viable-product stage.) > > It would be really interesting if someone were to try hacking simple > runtime CPU detection into BLIS and see how far you could get -- right > now they do kernel selection via the C preprocessor, but hacking in > some function pointer thing instead would not be that hard I think. A > maintainable library that builds on Linux/OSX/Windows, gets > competitive performance on last-but-one generation x86-64 CPUs, and > gets better-than-reference-BLAS performance everywhere else, would be > a very very compelling product that I bet would quickly attract the > necessary attention to make it competitive on all CPUs.
I wonder - is there anyone who might be able to do this work, if we found funding for a couple of months to do it? Cheers, Matthew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion