[Jprogramming] Apple arm gemm

Elijah Stone Wed, 30 Mar 2022 01:09:29 -0700

Recent apple arm CPUs include a hardware coprocessor for matrixmultiplication. This is nominally not directly accessible to user code(though it has been partly reverse engineered), but must be accessedthrough apple's blas implementation. Attached trivial patch makes j usethis rather than its own routines for large matrix multiplication ondarwin/arm. Performance delta is quite good. Before:


   a=. ?1e3 2e3$0
   b=. ?2e3 3e3$0
   100 timex 'a +/ . * b'
0.103497


after:

   100 timex 'a +/ . * b'
0.0274741
   0.103497%0.0274741
3.76708

Nearly 4x faster!

There seems to be a warmup period (big buffers go brrr...), so the gemmthreshold should perhaps be tuned. I did not take detailed measurements.


(Fine print: benchmarks taken on a 14in macbook w/m1pro.)

Also of note: on desktop (zen2), numpy is 3x faster than j. I triedswapping out j's mm microkernel for the newest from blis, and got only amodest boost, so the problem is not there. I think numpy is usingopenblas. (On arm, j and numpy are reasonably close, and the hardwareaccelerator smokes both.)


 -E
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

[Jprogramming] Apple arm gemm

Reply via email to