Camm Maguire wrote: > Greetings, and thanks for your work on the blas!
Thanks, but it's really not my work, Kazushige Goto deserves the credit. I just spent a couple of hours plugging it in. > Have you by chance timed these routines against the atlas package blas? Atlas > automatically tunes the blas for your particular hardware, and is open source. No, but I'll give it a try; I suspect the Goto routines (dgemm in particular) will be at least twice as fast because he does some really interesting unrolling things, and it's written in very tight assembler. You should see the inner loop of the gemm routine- it has exactly one add, one multiply, and two load/store instructions per clock cycle, hand-unrolled to 32 cycles, which totally maxes out the Alpha's theoretical performance. (I'm not an assembler programmer, just know a little bit about the chip's operation, and read through his code once a while back. BTW, the code is so well documented, it's worth the read even if you have no interest whatsoever in Alpha assembly.) It's *extremely* difficult, probably impossible, to get near this level of performance from compiled code, even using Compaq's new FORTRAN Linux compilers. In fact, his routine is twice as fast as the original assembler BLAS routines shipped by Digital!! Compaq has included his routines in their CXML library which ships with their Linux compilers- with the author's permission for a GPL exeption of course. There are Debian packages for the Q compilers and libs, but they're very non-free. At some point I'm going to check performance of this enhanced GPL BLAS package against CXML, and also try compiling the regular BLAS with Q compilers, to see whether the free libs (they're still free if compiled by a non-free compiler, right?) are as fast as the non-free CXML. I'll also at some point try to make a deb of Goto and Joachim Wesner's Free Fast Math lib for Alpha, which is not quite as fast as CPML, but (is free and) includes "vectorized" versions of sqrt, cos, sin which are substantially faster than looped calls to CPML (I think he said something like 12 clocks per argument, but I don't remember which function that was). > And what's better, it has hooks allowing the user to provide a small number of > kernel routines to time against the others and possibly include in the > finished > library. In fact, a few others and myself are working on that right now for > the > PIII, using the kni and prefetch x86 extensions. We're seeing significant > gains > with these instructions, and hope to contribute them to atlas soon. Sounds interesting. A bit beyond my understanding, but if there's significant vectorization, could it approach Alpha's 800 MFlops performance? How many add/mul double-precision floating units are there on PIII? > The interested reader can check out > > http://master.debian.org/~camm/kniblas.tgz > http://master.debian.org/~camm/mblasnt.ps.gz > http://master.debian.org/~camm/mblasger.ps.gz > http://master.debian.org/~camm/mblast.ps.gz Thanks very much, I'll give these a try (probably in a week or two) and report a comparison, also vs. the non-free Q libs. BTW, I haven't heard a reply to the legal question yet- "This software is in the public domain" is GPL-compatible, right? If not, I'll have to take down the patches and debs. :-( Zeen, -- Adam Powell http://lyre.mit.edu/~powell/ Thomas B. King Assistant Professor of Materials Engineering 77 Massachusetts Ave. Rm. 4-117 Phone (617) 452-2086 Cambridge, MA 02139 USA Fax (617) 253-5418

