The short answer is the Lapack benchmark. For years computers, especially supercomputers, have been benchmarked on how fast they perform this particular calculation. As a result, the code actually used in various accelerated BLAS implementations is highly tuned.
- [julia-users] Why lufact() is faster than my code? Staro Pickle
- Re: [julia-users] Why lufact() is faster than my code? Andreas Noack
- [julia-users] Why lufact() is faster than my code? Douglas Bates
- [julia-users] Re: Why lufact() is faster than my code? Jason Riedy
- [julia-users] Re: Why lufact() is faster than my co... Staro Pickle
- [julia-users] Re: Why lufact() is faster than my co... Staro Pickle
- [julia-users] Re: Why lufact() is faster than my co... Staro Pickle
- [julia-users] Re: Why lufact() is faster than m... Douglas Bates
- Re: [julia-users] Re: Why lufact() is faste... Andreas Noack
- [julia-users] Re: Why lufact() is faster than my code? Staro Pickle
