First of all, awesome work. It's great to see that it's possible to match or even exceed the performance of hand-crafted assembly implementations with generic code.

I would suggest adding more information on how the Eigen results were obtained. Unlike OpenBLAS, Eigen performance does often vary by compiler and varies greatly depending on the kind of preprocessor macros that are defined. In particular, EIGEN_NO_DEBUG is defined by default and reduces performance, EIGEN_FAST_MATH is not defined by default but can often increase performance and EIGEN_STACK_ALLOCATION_LIMIT matters greatly for performance on very small matrices (where MKL and especially OpenBLAS are very inefficient). It's been a while since I've used Eigen, so I may have forgotten one or two.

It may also be worth noting in the blog post that these are all single threaded comparisons and multithreaded implementations are on the way. This is obvious to anyone who's followed the development of Mir, but a general audience on Reddit will likely point it out as a deficiency unless stated upfront.

Reply via email to