Hi SystemML developers, I have been experimenting with SystemML lately. I found some interesting numbers when evaluating these two expressions: *t(M%*%N)* and *t(N)%*% t(M)* (The size of matrix M is 20Kx100 and N is 100x20K). The pipeline *t(N)%*% t(M) *drastically outperforms* t(M%*%N)*.
- *t(M%*%N)* - The multiplication took ~2.53 secs - The transpose took ~ 1.41 (The size of the intermediate *M%*%N *is 20Kx20K) - *t(N)%*% t(M) * - The multiplication took ~0.002 secs (given that the size of t(N) is 20Kx100 and t(M) is 100x20K ) - The total transpose time for both matrices is 0.023 secs So, I'm interested to know why the matrix multiplication in this pipeline *t(N)%*% t(M)* is faster than *t(M%*%N)*? and in general why *t(N)%*% t(M)* is faster than *t(M%*%N)*? *Experiments Configurations**: * - I'm using systemml-1.2.0 (Binary tgz) and working on a single node. - CPU E5-2640 v4 @ 2.40GHz, 40Cores - RAM : 123GB *t(N)%*%t(M) DML Script* | time = externalFunction(Integer i) return (Double B) | implemented in (classname="org.apache.sysml.udf.lib.TimeWrapper", exectype="mem"); | times = matrix(0.0, rows = 5, cols = 1) | M = rand(rows = 20000, cols = 100, pdf = 'uniform') | N = rand(rows = 100, cols = 20000, pdf = 'uniform') | for (ix in 1:5) { | start = time(1) | print("StartTime: "+start) | res = t(N)%*%t(M) | print(as.scalar(res[1,1])) | stop = time(1) | print("EndTime:- "+stop) | pipTime= (stop - start) / 1000 | print ("Time:"+ pipTime) | times[ix,1] = pipTime | } | times = t(times) *t(M%*%N) DML Script* | time = externalFunction(Integer i) return (Double B) | implemented in (classname="org.apache.sysml.udf.lib.TimeWrapper", exectype="mem"); | times = matrix(0.0, rows = 5, cols = 1) | M = rand(rows = 20000, cols = 100, pdf = 'uniform') | N = rand(rows = 100, cols = 20000, pdf = 'uniform') | for (ix in 1:5) { | start = time(1) | print("StartTime: "+start) | res = t(M%*%N) | print(as.scalar(res[1,1])) | end = time(1) | print("EndTime: "+end) | pipTime = (end - start)/1000 | print ("Time:"+ pipTime) | times[ix,1] = pipTime | } | times = t(times) *t(N)%*%t(M) RUN TIME EXPLAIN * # Memory Budget local/remote = 63716MB/140MB/140MB # Degree of Parallelism (vcores) local/remote = 40/1/1 PROGRAM ( size CP/MR = 58/0 ) --FUNCTIONS ----FUNCTION CALL GRAPH ------MAIN PROGRAM --------.defaultNS::time ----EXTERNAL FUNCTION .defaultNS::time --MAIN PROGRAM ----GENERIC (lines 4-6) [recompile=false] ------CP createvar _mVar0 scratch_space//_p32870_192.168.250.59//_t0/temp0 true MATRIX binaryblock 5 1 1000 1000 0 copy ------CP rand 5.SCALAR.INT.true 1.SCALAR.INT.true 1000 1000 0.0 0.0 1.0 -1 uniform 1.0 40 _mVar0.MATRIX.DOUBLE ------CP createvar _mVar1 scratch_space//_p32870_192.168.250.59//_t0/temp1 true MATRIX binaryblock 20000 100 1000 1000 2000000 copy ------CP rand 20000.SCALAR.INT.true 100.SCALAR.INT.true 1000 1000 0.0 1.0 1.0 -1 uniform 1.0 40 _mVar1.MATRIX.DOUBLE ------CP createvar _mVar2 scratch_space//_p32870_192.168.250.59//_t0/temp2 true MATRIX binaryblock 100 20000 1000 1000 2000000 copy ------CP rand 100.SCALAR.INT.true 20000.SCALAR.INT.true 1000 1000 0.0 1.0 1.0 -1 uniform 1.0 40 _mVar2.MATRIX.DOUBLE ------CP cpvar _mVar0 times ------CP cpvar _mVar1 M ------CP cpvar _mVar2 N ------CP rmvar _mVar0 _mVar1 _mVar2 ----FOR (lines 0-0) [in-place=[times]] ------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT ------CP assignvar 5.SCALAR.INT.true __pred.SCALAR.INT ------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT ------GENERIC (lines 8-8) [recompile=false] --------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true start ------GENERIC (lines 9-12) [recompile=false] --------CP + StartTime: .SCALAR.STRING.true start.SCALAR.DOUBLE.false _Var3.SCALAR.STRING --------CP createvar _mVar4 scratch_space//_p32870_192.168.250.59//_t0/temp3 true MATRIX binarycell 20000 100 -1 -1 2000000 copy --------CP r' N.MATRIX.DOUBLE _mVar4.MATRIX.DOUBLE 40 --------CP createvar _mVar5 scratch_space//_p32870_192.168.250.59//_t0/temp4 true MATRIX binarycell 1 100 -1 -1 -1 copy --------CP rightIndex _mVar4.MATRIX.DOUBLE 1.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true 100.SCALAR.INT.true _mVar5.MATRIX.DOUBLE --------CP rmvar _mVar4 --------CP createvar _mVar6 scratch_space//_p32870_192.168.250.59//_t0/temp5 true MATRIX binarycell 100 20000 -1 -1 2000000 copy --------CP r' M.MATRIX.DOUBLE _mVar6.MATRIX.DOUBLE 40 --------CP createvar _mVar7 scratch_space//_p32870_192.168.250.59//_t0/temp6 true MATRIX binarycell 100 1 -1 -1 -1 copy --------CP rightIndex _mVar6.MATRIX.DOUBLE 1.SCALAR.INT.true 100.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true _mVar7.MATRIX.DOUBLE --------CP rmvar _mVar6 --------CP createvar _mVar8 scratch_space//_p32870_192.168.250.59//_t0/temp7 true MATRIX binarycell 1 1 -1 -1 -1 copy --------CP ba+* _mVar5.MATRIX.DOUBLE _mVar7.MATRIX.DOUBLE _mVar8.MATRIX.DOUBLE 40 --------CP rmvar _mVar5 _mVar7 --------CP createvar _mVar9 scratch_space//_p32870_192.168.250.59//_t0/temp8 true MATRIX binarycell 1 1 -1 -1 -1 copy --------CP rightIndex _mVar8.MATRIX.DOUBLE 1.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true _mVar9.MATRIX.DOUBLE --------CP rmvar _mVar8 --------CP castdts _mVar9.MATRIX.DOUBLE.false _Var10.SCALAR.STRING --------CP rmvar _mVar9 --------CP print _Var3.SCALAR.STRING.false _Var11.SCALAR.STRING --------CP rmvar _Var3 --------CP print _Var10.SCALAR.STRING.false _Var12.SCALAR.STRING --------CP rmvar _Var10 --------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true stop --------CP rmvar _Var11 _Var12 ------GENERIC (lines 13-16) [recompile=false] --------CP + EndTime:- .SCALAR.STRING.true stop.SCALAR.DOUBLE.false _Var13.SCALAR.STRING --------CP - stop.SCALAR.DOUBLE.false start.SCALAR.DOUBLE.false _Var14.SCALAR.DOUBLE --------CP / _Var14.SCALAR.DOUBLE.false 1000.SCALAR.INT.true _Var15.SCALAR.DOUBLE --------CP rmvar _Var14 --------CP + Time:.SCALAR.STRING.true _Var15.SCALAR.DOUBLE.false _Var16.SCALAR.STRING --------CP createvar _mVar17 scratch_space//_p32870_192.168.250.59//_t0/temp9 true MATRIX binarycell 5 1 -1 -1 -1 copy --------CP leftIndex times.MATRIX.DOUBLE _Var15.SCALAR.DOUBLE.false ix.SCALAR.INT.false ix.SCALAR.INT.false 1.SCALAR.INT.true 1.SCALAR.INT.true _mVar17.MATRIX.DOUBLE --------CP rmvar _Var15 --------CP print _Var13.SCALAR.STRING.false _Var18.SCALAR.STRING --------CP rmvar _Var13 --------CP print _Var16.SCALAR.STRING.false _Var19.SCALAR.STRING --------CP rmvar _Var16 times _Var18 _Var19 --------CP cpvar _mVar17 times --------CP rmvar _mVar17 start stop ----GENERIC (lines 18-19) [recompile=false] ------CP createvar _mVar20 scratch_space//_p32870_192.168.250.59//_t0/temp10 true MATRIX binarycell 1 5 -1 -1 -1 copy ------CP r' times.MATRIX.DOUBLE _mVar20.MATRIX.DOUBLE 40 ------CP rmvar times ------CP cpvar _mVar20 times ------CP rmvar _mVar20 times SystemML Statistics: Total elapsed time: 1.178 sec. Total compilation time: 0.591 sec. Total execution time: 0.587 sec. Number of compiled MR Jobs: 0. Number of executed MR Jobs: 0. Cache hits (Mem, WB, FS, HDFS): 47/0/0/0. Cache writes (WB, FS, HDFS): 12/0/0. Cache times (ACQr/m, RLS, EXP): 0.001/0.001/0.002/0.000 sec. HOP DAGs recompiled (PRED, SB): 0/0. HOP DAGs recompile time: 0.000 sec. Total JIT compile time: 29.332 sec. Total JVM GC count: 8. Total JVM GC time: 0.38 sec. Heavy hitter instructions: # Instruction Time(s) Count 1 rand 0.307 3 2 r' 0.256 11 3 ba+* 0.010 5 4 time 0.004 10 5 rmvar 0.003 67 6 rightIndex 0.002 15 7 print 0.001 20 8 createvar 0.001 39 9 leftIndex 0.000 5 10 + 0.000 15 *t(M%*%N) RUN TIME EXPLAIN * # Memory Budget local/remote = 63716MB/140MB/140MB # Degree of Parallelism (vcores) local/remote = 40/1/1 PROGRAM ( size CP/MR = 50/0 ) --FUNCTIONS ----FUNCTION CALL GRAPH ------MAIN PROGRAM --------.defaultNS::time ----EXTERNAL FUNCTION .defaultNS::time --MAIN PROGRAM ----GENERIC (lines 4-6) [recompile=false] ------CP createvar _mVar0 scratch_space//_p32616_192.168.250.59//_t0/temp0 true MATRIX binaryblock 5 1 1000 1000 0 copy ------CP rand 5.SCALAR.INT.true 1.SCALAR.INT.true 1000 1000 0.0 0.0 1.0 -1 uniform 1.0 40 _mVar0.MATRIX.DOUBLE ------CP createvar _mVar1 scratch_space//_p32616_192.168.250.59//_t0/temp1 true MATRIX binaryblock 20000 100 1000 1000 2000000 copy ------CP rand 20000.SCALAR.INT.true 100.SCALAR.INT.true 1000 1000 0.0 1.0 1.0 -1 uniform 1.0 40 _mVar1.MATRIX.DOUBLE ------CP createvar _mVar2 scratch_space//_p32616_192.168.250.59//_t0/temp2 true MATRIX binaryblock 100 20000 1000 1000 2000000 copy ------CP rand 100.SCALAR.INT.true 20000.SCALAR.INT.true 1000 1000 0.0 1.0 1.0 -1 uniform 1.0 40 _mVar2.MATRIX.DOUBLE ------CP cpvar _mVar0 times ------CP cpvar _mVar1 M ------CP cpvar _mVar2 N ------CP rmvar _mVar0 _mVar1 _mVar2 ----FOR (lines 0-0) [in-place=[times]] ------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT ------CP assignvar 5.SCALAR.INT.true __pred.SCALAR.INT ------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT ------GENERIC (lines 8-8) [recompile=false] --------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true start ------GENERIC (lines 9-12) [recompile=false] --------CP + StartTime: .SCALAR.STRING.true start.SCALAR.DOUBLE.false _Var3.SCALAR.STRING --------CP createvar _mVar4 scratch_space//_p32616_192.168.250.59//_t0/temp3 true MATRIX binarycell 20000 20000 -1 -1 -1 copy --------CP ba+* M.MATRIX.DOUBLE N.MATRIX.DOUBLE _mVar4.MATRIX.DOUBLE 40 --------CP createvar _mVar5 scratch_space//_p32616_192.168.250.59//_t0/temp4 true MATRIX binarycell 20000 20000 -1 -1 -1 copy --------CP r' _mVar4.MATRIX.DOUBLE _mVar5.MATRIX.DOUBLE 40 --------CP rmvar _mVar4 --------CP createvar _mVar6 scratch_space//_p32616_192.168.250.59//_t0/temp5 true MATRIX binarycell 1 1 -1 -1 -1 copy --------CP rightIndex _mVar5.MATRIX.DOUBLE 1.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true _mVar6.MATRIX.DOUBLE --------CP rmvar _mVar5 --------CP castdts _mVar6.MATRIX.DOUBLE.false _Var7.SCALAR.STRING --------CP rmvar _mVar6 --------CP print _Var3.SCALAR.STRING.false _Var8.SCALAR.STRING --------CP rmvar _Var3 --------CP print _Var7.SCALAR.STRING.false _Var9.SCALAR.STRING --------CP rmvar _Var7 --------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true end --------CP rmvar _Var8 _Var9 ------GENERIC (lines 13-16) [recompile=false] --------CP + EndTime: .SCALAR.STRING.true end.SCALAR.DOUBLE.false _Var10.SCALAR.STRING --------CP - end.SCALAR.DOUBLE.false start.SCALAR.DOUBLE.false _Var11.SCALAR.DOUBLE --------CP / _Var11.SCALAR.DOUBLE.false 1000.SCALAR.INT.true _Var12.SCALAR.DOUBLE --------CP rmvar _Var11 --------CP + Time:.SCALAR.STRING.true _Var12.SCALAR.DOUBLE.false _Var13.SCALAR.STRING --------CP createvar _mVar14 scratch_space//_p32616_192.168.250.59//_t0/temp6 true MATRIX binarycell 5 1 -1 -1 -1 copy --------CP leftIndex times.MATRIX.DOUBLE _Var12.SCALAR.DOUBLE.false ix.SCALAR.INT.false ix.SCALAR.INT.false 1.SCALAR.INT.true 1.SCALAR.INT.true _mVar14.MATRIX.DOUBLE --------CP rmvar _Var12 --------CP print _Var10.SCALAR.STRING.false _Var15.SCALAR.STRING --------CP rmvar _Var10 --------CP print _Var13.SCALAR.STRING.false _Var16.SCALAR.STRING --------CP rmvar _Var13 times _Var15 _Var16 --------CP cpvar _mVar14 times --------CP rmvar _mVar14 start end ----GENERIC (lines 18-19) [recompile=false] ------CP createvar _mVar17 scratch_space//_p32616_192.168.250.59//_t0/temp7 true MATRIX binarycell 1 5 -1 -1 -1 copy ------CP r' times.MATRIX.DOUBLE _mVar17.MATRIX.DOUBLE 40 ------CP rmvar times ------CP cpvar _mVar17 times ------CP rmvar _mVar17 times SystemML Statistics: Total elapsed time: 22.074 sec. Total compilation time: 0.567 sec. Total execution time: 21.507 sec. Number of compiled MR Jobs: 0. Number of executed MR Jobs: 0. Cache hits (Mem, WB, FS, HDFS): 32/0/0/0. Cache writes (WB, FS, HDFS): 12/0/0. Cache times (ACQr/m, RLS, EXP): 0.000/0.001/0.003/0.000 sec. HOP DAGs recompiled (PRED, SB): 0/0. HOP DAGs recompile time: 0.000 sec. Total JIT compile time: 33.155 sec. Total JVM GC count: 7. Total JVM GC time: 0.279 sec. Heavy hitter instructions: # Instruction Time(s) Count 1 ba+* 12.661 5 2 r' 8.496 6 3 rand 0.334 3 4 time 0.006 10 5 rmvar 0.003 57 6 rightIndex 0.002 5 7 print 0.001 20 8 createvar 0.001 24 9 + 0.000 15 10 leftIndex 0.000 5 Thanks, Rana