https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51119
--- Comment #20 from Joost VandeVondele <Joost.VandeVondele at mat dot ethz.ch> --- (In reply to Jerry DeLisle from comment #19) > If I can get something working I am thinking something like > -fexternal-blas-n, if -n not given then default to current libblas > behaviour. This way users have some control. With GPUs, it is not unusual to > have hundreds of cores. We can also, at run time, see if the opencl is > already initialized which may mean used elsewhere so don't mess with it. Hidden behind a -fexternal-blas-n switch might be an option. Including GPUs seems even a tad more tricky. We have a paper on GPU (small) matrix multiplication, http://dbcsr.cp2k.org/_media/gpu_book_chapter_submitted.pdf . BTW, another interesting project is the libxsmm library more aimed at small (<128) matrices see : https://github.com/hfp/libxsmm . Not sure if this info is useful in this context, but it might provide inspiration.