Dear Programming in J,

I made another test of numerical calculation in J,
this time looking at multiplying two matrices using
(+/ .*)  and here is what I have found.  It seems to
me that J with (+/ .*) has acceptable speed only for
matrices of order about 128 or below, after which order it
quickly falls behind other standard numerical software such
as python with numpy, and Octave.  I also wrote a naive C
program for matrix multiplication; for orders 256, 1024,
..., 8192 J tracks as 2 to 4 faster than the naive C program
(which does not do SIMD or mind caching much).

Numpy and Octave are able to use multiple threads and/or cores
just by calling ordinary 'matmul', and they are about 7 to
25 times as fast as J in my experiment.  As a primitive in J
the command (+/ .*) could be just as fast as in any competent
numerical program available in C for matrix multiplication.
Even if you do not want multithreading in J, it seems to
me that (+/ .*) has roughly 1/4 or 1/8 the speed of what should
be possible for a single threaded program.  It seems especially
troubling that it becomes just as slow as a plain vanilla
naive C program for larger sizes of the matrices.  I am not sure
why J does not seem to use BLAS or LAPACK for matrix multiplication.

Yours sincerely,
Imre Patyi

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Here is the summary of timings.

n time, C time, J time, python time, Octave (time, J)/(time, C) (time,
J)/(time, python) (time, J)/(time, Octave)
256 0.0780 0.0073 0.0010 0.0007 0.0936 7.3047 9.8987
512 0.2680 0.0671 0.0100 0.0050 0.2505 6.7137 13.4195
1024 1.8400 0.7293 0.0479 0.0380 0.3964 15.2255 19.1919
2048 14.0430 6.0432 0.2663 0.2851 0.4303 22.6938 21.1960
4096 109.8290 54.4634 2.2739 2.1620 0.4959 23.9513 25.1917
8192 874.8430 435.2600 17.1282 17.2197 0.4975 25.4120 25.2769

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
File: example-of-matmul.ijs

f=: 3 : 0
N=.y
a=.2 o. ((4 : '(1234*x)+(5678*y)')"0 0)/~ (i.N)
NB.smoutput(i.5){(i.5){a
NB.smoutput''
t=.timex'b=:a(+/ .*)a'
NB.smoutput(i.5){(i.5){b
NB.t;(60 60#:t)
t
)

NB. Sample run.
NB.   ,.f"0]2^>:i.13
NB. 0.0135541
NB.   3.5e_6
NB.   2.9e_6
NB.     4e_6
NB.  1.77e_5
NB. 0.0001052
NB. 0.0008633
NB. 0.0072972
NB. 0.0671373
NB. 0.729313
NB.  6.04315
NB.  54.4634
NB.   435.26


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
File: example-with-numpy.py

import numpy, time
def f(n):
    i=numpy.array(numpy.arange(n).reshape((1,n)))
    a=numpy.cos(numpy.array(1234*i+5678*i.T))
    #print(a.shape)
    t0=time.time()
    b=numpy.matmul(a,a)
    return time.time()-t0

for i in range(1,1+13):
    print(f(2**i))


r"""     Sample run.
C:>py "example-with-numpy.py"
0.0020143985748291016
0.0
0.0
0.0
0.0
0.0009746551513671875
0.0
0.0009989738464355469
0.009999990463256836
0.04790067672729492
0.26629042625427246
2.273921251296997
17.128154277801514
"""

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
File:  The command I used in Octave.

>> for n=2.^(1:13) ; i=(0:n-1) ; a=cos(1234*i'+5678*i) ; tic,b=a*a;toc, end
Elapsed time is 1.3113e-05 seconds.
Elapsed time is 1.90735e-05 seconds.
Elapsed time is 1.38283e-05 seconds.
Elapsed time is 1.3113e-05 seconds.
Elapsed time is 2.09808e-05 seconds.
Elapsed time is 4.88758e-05 seconds.
Elapsed time is 0.000244141 seconds.
Elapsed time is 0.00073719 seconds.
Elapsed time is 0.00500298 seconds.
Elapsed time is 0.0380011 seconds.
Elapsed time is 0.285108 seconds.
Elapsed time is 2.16196 seconds.
Elapsed time is 17.2197 seconds.
>>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
File: example-of-naive-matmul.c

#include <stdlib.h>
#include <stdio.h>
#include <math.h>

int
main(int argc, char **argv){

int N ;
if(argc==0){
N=8192 ;
} else {
N=atoi(argv[1]) ;
}

double *a=(double*)calloc(N*N,sizeof(double));
double *aT=(double*)calloc(N*N,sizeof(double));
for(int i=0 ; i<N ; i++){
for(int j =0 ; j<N ; j++){
a[i+N*j]=aT[j+N*i]=cos(1234*i+5678*j) ;
}
}

double *b=(double*)calloc(N*N,sizeof(double));
for(int i=0 ; i<N ; i++){
for(int j=0 ; j<N ; j++){
double bij=0.0 ;
for(int k=0 ; k<N ; k++){
bij += aT[k+N*i]*a[k+N*j] ;
}
b[i+N*j]=bij ;
}
}
printf("\n") ;
/*
for(int i=0 ; i<5 ; i++){
for(int j=0 ; j<5 ; j++){
printf("%f\t",a[i+N*j]) ;
}
printf("\n") ;
}
printf("\n") ;
for(int i=0 ; i<5 ; i++){
for(int j=0 ; j<5 ; j++){
printf("%f\t",b[i+N*j]) ;
}
printf("\n") ;
}
*/
}

/* Sample run.
$ cc -o example-of-naive-matmul{,.c} -O3
$ for i in {1..13}; do n=`echo 2^$i|bc`; echo $n ; time
./example-of-naive-matmul $n ; done
2


real    0m0.038s
user    0m0.015s
sys     0m0.000s
4


real    0m0.045s
user    0m0.000s
sys     0m0.030s
8


real    0m0.047s
user    0m0.030s
sys     0m0.000s
16


real    0m0.046s
user    0m0.046s
sys     0m0.015s
32


real    0m0.051s
user    0m0.015s
sys     0m0.000s
64


real    0m0.046s
user    0m0.000s
sys     0m0.030s
128


real    0m0.045s
user    0m0.000s
sys     0m0.046s
256


real    0m0.078s
user    0m0.015s
sys     0m0.030s
512


real    0m0.268s
user    0m0.218s
sys     0m0.030s
1024


real    0m1.840s
user    0m1.811s
sys     0m0.030s
2048


real    0m14.043s
user    0m13.937s
sys     0m0.062s
4096


real    1m49.829s
user    1m49.578s
sys     0m0.125s
8192


real    14m34.843s
user    14m33.046s
sys     0m0.874s

*/

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I ran all of the above on a lower midrange laptop with Windows 10,
i5, 8GB RAM, 2 cores, 4 threads; I used J805, Anaconda python 3.5,
Octave 5.2.0.
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to