Dear Programming in J, I made another test of numerical calculation in J, this time looking at multiplying two matrices using (+/ .*) and here is what I have found. It seems to me that J with (+/ .*) has acceptable speed only for matrices of order about 128 or below, after which order it quickly falls behind other standard numerical software such as python with numpy, and Octave. I also wrote a naive C program for matrix multiplication; for orders 256, 1024, ..., 8192 J tracks as 2 to 4 faster than the naive C program (which does not do SIMD or mind caching much).
Numpy and Octave are able to use multiple threads and/or cores just by calling ordinary 'matmul', and they are about 7 to 25 times as fast as J in my experiment. As a primitive in J the command (+/ .*) could be just as fast as in any competent numerical program available in C for matrix multiplication. Even if you do not want multithreading in J, it seems to me that (+/ .*) has roughly 1/4 or 1/8 the speed of what should be possible for a single threaded program. It seems especially troubling that it becomes just as slow as a plain vanilla naive C program for larger sizes of the matrices. I am not sure why J does not seem to use BLAS or LAPACK for matrix multiplication. Yours sincerely, Imre Patyi ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Here is the summary of timings. n time, C time, J time, python time, Octave (time, J)/(time, C) (time, J)/(time, python) (time, J)/(time, Octave) 256 0.0780 0.0073 0.0010 0.0007 0.0936 7.3047 9.8987 512 0.2680 0.0671 0.0100 0.0050 0.2505 6.7137 13.4195 1024 1.8400 0.7293 0.0479 0.0380 0.3964 15.2255 19.1919 2048 14.0430 6.0432 0.2663 0.2851 0.4303 22.6938 21.1960 4096 109.8290 54.4634 2.2739 2.1620 0.4959 23.9513 25.1917 8192 874.8430 435.2600 17.1282 17.2197 0.4975 25.4120 25.2769 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ File: example-of-matmul.ijs f=: 3 : 0 N=.y a=.2 o. ((4 : '(1234*x)+(5678*y)')"0 0)/~ (i.N) NB.smoutput(i.5){(i.5){a NB.smoutput'' t=.timex'b=:a(+/ .*)a' NB.smoutput(i.5){(i.5){b NB.t;(60 60#:t) t ) NB. Sample run. NB. ,.f"0]2^>:i.13 NB. 0.0135541 NB. 3.5e_6 NB. 2.9e_6 NB. 4e_6 NB. 1.77e_5 NB. 0.0001052 NB. 0.0008633 NB. 0.0072972 NB. 0.0671373 NB. 0.729313 NB. 6.04315 NB. 54.4634 NB. 435.26 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ File: example-with-numpy.py import numpy, time def f(n): i=numpy.array(numpy.arange(n).reshape((1,n))) a=numpy.cos(numpy.array(1234*i+5678*i.T)) #print(a.shape) t0=time.time() b=numpy.matmul(a,a) return time.time()-t0 for i in range(1,1+13): print(f(2**i)) r""" Sample run. C:>py "example-with-numpy.py" 0.0020143985748291016 0.0 0.0 0.0 0.0 0.0009746551513671875 0.0 0.0009989738464355469 0.009999990463256836 0.04790067672729492 0.26629042625427246 2.273921251296997 17.128154277801514 """ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ File: The command I used in Octave. >> for n=2.^(1:13) ; i=(0:n-1) ; a=cos(1234*i'+5678*i) ; tic,b=a*a;toc, end Elapsed time is 1.3113e-05 seconds. Elapsed time is 1.90735e-05 seconds. Elapsed time is 1.38283e-05 seconds. Elapsed time is 1.3113e-05 seconds. Elapsed time is 2.09808e-05 seconds. Elapsed time is 4.88758e-05 seconds. Elapsed time is 0.000244141 seconds. Elapsed time is 0.00073719 seconds. Elapsed time is 0.00500298 seconds. Elapsed time is 0.0380011 seconds. Elapsed time is 0.285108 seconds. Elapsed time is 2.16196 seconds. Elapsed time is 17.2197 seconds. >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ File: example-of-naive-matmul.c #include <stdlib.h> #include <stdio.h> #include <math.h> int main(int argc, char **argv){ int N ; if(argc==0){ N=8192 ; } else { N=atoi(argv[1]) ; } double *a=(double*)calloc(N*N,sizeof(double)); double *aT=(double*)calloc(N*N,sizeof(double)); for(int i=0 ; i<N ; i++){ for(int j =0 ; j<N ; j++){ a[i+N*j]=aT[j+N*i]=cos(1234*i+5678*j) ; } } double *b=(double*)calloc(N*N,sizeof(double)); for(int i=0 ; i<N ; i++){ for(int j=0 ; j<N ; j++){ double bij=0.0 ; for(int k=0 ; k<N ; k++){ bij += aT[k+N*i]*a[k+N*j] ; } b[i+N*j]=bij ; } } printf("\n") ; /* for(int i=0 ; i<5 ; i++){ for(int j=0 ; j<5 ; j++){ printf("%f\t",a[i+N*j]) ; } printf("\n") ; } printf("\n") ; for(int i=0 ; i<5 ; i++){ for(int j=0 ; j<5 ; j++){ printf("%f\t",b[i+N*j]) ; } printf("\n") ; } */ } /* Sample run. $ cc -o example-of-naive-matmul{,.c} -O3 $ for i in {1..13}; do n=`echo 2^$i|bc`; echo $n ; time ./example-of-naive-matmul $n ; done 2 real 0m0.038s user 0m0.015s sys 0m0.000s 4 real 0m0.045s user 0m0.000s sys 0m0.030s 8 real 0m0.047s user 0m0.030s sys 0m0.000s 16 real 0m0.046s user 0m0.046s sys 0m0.015s 32 real 0m0.051s user 0m0.015s sys 0m0.000s 64 real 0m0.046s user 0m0.000s sys 0m0.030s 128 real 0m0.045s user 0m0.000s sys 0m0.046s 256 real 0m0.078s user 0m0.015s sys 0m0.030s 512 real 0m0.268s user 0m0.218s sys 0m0.030s 1024 real 0m1.840s user 0m1.811s sys 0m0.030s 2048 real 0m14.043s user 0m13.937s sys 0m0.062s 4096 real 1m49.829s user 1m49.578s sys 0m0.125s 8192 real 14m34.843s user 14m33.046s sys 0m0.874s */ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ I ran all of the above on a lower midrange laptop with Windows 10, i5, 8GB RAM, 2 cores, 4 threads; I used J805, Anaconda python 3.5, Octave 5.2.0. ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm