Recent apple arm CPUs include a hardware coprocessor for matrix
multiplication. This is nominally not directly accessible to user code
(though it has been partly reverse engineered), but must be accessed
through apple's blas implementation. Attached trivial patch makes j use
this rather than its own routines for large matrix multiplication on
darwin/arm. Performance delta is quite good. Before:
a=. ?1e3 2e3$0
b=. ?2e3 3e3$0
100 timex 'a +/ . * b'
0.103497
after:
100 timex 'a +/ . * b'
0.0274741
0.103497%0.0274741
3.76708
Nearly 4x faster!
There seems to be a warmup period (big buffers go brrr...), so the gemm
threshold should perhaps be tuned. I did not take detailed measurements.
(Fine print: benchmarks taken on a 14in macbook w/m1pro.)
Also of note: on desktop (zen2), numpy is 3x faster than j. I tried
swapping out j's mm microkernel for the newest from blis, and got only a
modest boost, so the problem is not there. I think numpy is using
openblas. (On arm, j and numpy are reasonably close, and the hardware
accelerator smokes both.)
-E
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm