I will write GEMM and GEMV families of BLAS for Phobos.
- code without assembler
- code based on SIMD instructions
- DMD/LDC/GDC support
- kernel based architecture like OpenBLAS
- 85-100% FLOPS comparing with OpenBLAS (100%)
- tiny generic code comparing with OpenBLAS
- ability to define user kernels
- allocators support. GEMM requires small internal allocations.
- @nogc nothrow pure template functions (depends on allocator)
- optional multithreaded
- ability to work with `Slice` multidimensional arrays when
stride between elements in vector is greater than 1. In common
BLAS matrix strides between rows or columns always equals 1.
LDC all : very generic D/LLVM IR kernels. AVX/2/512/neon
support is out of the box.
DMD/GDC x86 : kernels for 8 XMM registers based on core.simd
DMD/GDC x86_64: kernels for 16 XMM registers based on core.simd
DMD/GDC other : generic kernels without SIMD instructions.
AVX/2/512 support can be added in the future.
 Anatomy of High-Performance Matrix Multiplication:
 OpenBLAS https://github.com/xianyi/OpenBLAS
Happy New Year!