First please see, for example, this ARM CPU http://img.deusm.com/eetimes/AMD01.png
L1 cache per 1 core. L2 cache per 2 cores. L3 cache for all (8) cores.It is critical for multithread BLAS (and D BLAS is going to be the best BLAS ever) to estimate that L2 cache is per 2 cores.
In the same time it is impossible to collect information about each ARM CPU configuration. Any hope to receive this information at runtime?
Best regards, Ilya
