I will write GEMM and GEMV families of BLAS for Phobos.

 - code without assembler
 - code based on SIMD instructions
 - DMD/LDC/GDC support
 - kernel based architecture like OpenBLAS
 - 85-100% FLOPS comparing with OpenBLAS (100%)
 - tiny generic code comparing with OpenBLAS
 - ability to define user kernels
 - allocators support. GEMM requires small internal allocations.
 - @nogc nothrow pure template functions (depends on allocator)
 - optional multithreaded
- ability to work with `Slice` multidimensional arrays when stride between elements in vector is greater than 1. In common BLAS matrix strides between rows or columns always equals 1.

Implementation details:
LDC all : very generic D/LLVM IR kernels. AVX/2/512/neon support is out of the box.
DMD/GDC x86   : kernels for  8 XMM registers based on core.simd
DMD/GDC x86_64: kernels for 16 XMM registers based on core.simd
DMD/GDC other : generic kernels without SIMD instructions. AVX/2/512 support can be added in the future.

[1] Anatomy of High-Performance Matrix Multiplication: http://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf
[2] OpenBLAS  https://github.com/xianyi/OpenBLAS

