Hi SystemML developers,

I have been experimenting with SystemML lately. I found some interesting
numbers when evaluating these two expressions: *t(M%*%N)* and *t(N)%*% t(M)*
(The size of matrix M is 20Kx100 and N is 100x20K). The pipeline *t(N)%*%
t(M) *drastically outperforms* t(M%*%N)*.

   - *t(M%*%N)*
   - The multiplication took ~2.53 secs
      - The transpose took ~ 1.41 (The size of the intermediate *M%*%N *is
      20Kx20K)
   - *t(N)%*% t(M) *
   - The multiplication took ~0.002 secs (given that the size of t(N) is
      20Kx100 and t(M) is 100x20K )
      - The total transpose time for both matrices is 0.023 secs

So, I'm interested to know why the matrix multiplication in this
pipeline *t(N)%*%
t(M)* is faster than *t(M%*%N)*? and in general why *t(N)%*% t(M)* is
faster than *t(M%*%N)*?

*Experiments Configurations**: *

   - I'm using  systemml-1.2.0 (Binary tgz) and working on a single node.
   - CPU E5-2640 v4 @ 2.40GHz, 40Cores
   - RAM : 123GB

   *t(N)%*%t(M) DML Script*
    | time = externalFunction(Integer i) return (Double B)
    |       implemented in
(classname="org.apache.sysml.udf.lib.TimeWrapper", exectype="mem");
    |  times = matrix(0.0, rows = 5, cols = 1)
    |  M = rand(rows = 20000, cols = 100, pdf = 'uniform')
    |  N = rand(rows = 100, cols = 20000, pdf = 'uniform')
    | for (ix in 1:5) {
    |   start = time(1)
    |   print("StartTime: "+start)
    |   res = t(N)%*%t(M)
    |   print(as.scalar(res[1,1]))
    |   stop = time(1)
    |   print("EndTime:- "+stop)
    |   pipTime= (stop - start) / 1000
    |   print ("Time:"+ pipTime)
    |   times[ix,1] = pipTime
    | }
    | times = t(times)

    *t(M%*%N) DML Script*
    | time = externalFunction(Integer i) return (Double B)
    |       implemented in
(classname="org.apache.sysml.udf.lib.TimeWrapper", exectype="mem");
    |  times = matrix(0.0, rows = 5, cols = 1)
    |  M = rand(rows = 20000, cols = 100, pdf = 'uniform')
    |  N = rand(rows = 100, cols = 20000, pdf = 'uniform')
    | for (ix in 1:5) {
    |   start = time(1)
    |   print("StartTime: "+start)
    |   res = t(M%*%N)
    |   print(as.scalar(res[1,1]))
    |   end = time(1)
    |   print("EndTime: "+end)
    |   pipTime = (end - start)/1000
    |   print ("Time:"+ pipTime)
    |   times[ix,1] = pipTime
    | }
    | times = t(times)

*t(N)%*%t(M) RUN TIME EXPLAIN *
# Memory Budget local/remote = 63716MB/140MB/140MB
# Degree of Parallelism (vcores) local/remote = 40/1/1
PROGRAM ( size CP/MR = 58/0 )
--FUNCTIONS
----FUNCTION CALL GRAPH
------MAIN PROGRAM
--------.defaultNS::time
----EXTERNAL FUNCTION .defaultNS::time
--MAIN PROGRAM
----GENERIC (lines 4-6) [recompile=false]
------CP createvar _mVar0 scratch_space//_p32870_192.168.250.59//_t0/temp0
true MATRIX binaryblock 5 1 1000 1000 0 copy
------CP rand 5.SCALAR.INT.true 1.SCALAR.INT.true 1000 1000 0.0 0.0 1.0 -1
uniform 1.0 40 _mVar0.MATRIX.DOUBLE
------CP createvar _mVar1 scratch_space//_p32870_192.168.250.59//_t0/temp1
true MATRIX binaryblock 20000 100 1000 1000 2000000 copy
------CP rand 20000.SCALAR.INT.true 100.SCALAR.INT.true 1000 1000 0.0 1.0
1.0 -1 uniform 1.0 40 _mVar1.MATRIX.DOUBLE
------CP createvar _mVar2 scratch_space//_p32870_192.168.250.59//_t0/temp2
true MATRIX binaryblock 100 20000 1000 1000 2000000 copy
------CP rand 100.SCALAR.INT.true 20000.SCALAR.INT.true 1000 1000 0.0 1.0
1.0 -1 uniform 1.0 40 _mVar2.MATRIX.DOUBLE
------CP cpvar _mVar0 times
------CP cpvar _mVar1 M
------CP cpvar _mVar2 N
------CP rmvar _mVar0 _mVar1 _mVar2
----FOR (lines 0-0) [in-place=[times]]
------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT
------CP assignvar 5.SCALAR.INT.true __pred.SCALAR.INT
------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT
------GENERIC (lines 8-8) [recompile=false]
--------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true start
------GENERIC (lines 9-12) [recompile=false]
--------CP + StartTime: .SCALAR.STRING.true start.SCALAR.DOUBLE.false
_Var3.SCALAR.STRING
--------CP createvar _mVar4
scratch_space//_p32870_192.168.250.59//_t0/temp3 true MATRIX binarycell
20000 100 -1 -1 2000000 copy
--------CP r' N.MATRIX.DOUBLE _mVar4.MATRIX.DOUBLE 40
--------CP createvar _mVar5
scratch_space//_p32870_192.168.250.59//_t0/temp4 true MATRIX binarycell 1
100 -1 -1 -1 copy
--------CP rightIndex _mVar4.MATRIX.DOUBLE 1.SCALAR.INT.true
1.SCALAR.INT.true 1.SCALAR.INT.true 100.SCALAR.INT.true _mVar5.MATRIX.DOUBLE
--------CP rmvar _mVar4
--------CP createvar _mVar6
scratch_space//_p32870_192.168.250.59//_t0/temp5 true MATRIX binarycell 100
20000 -1 -1 2000000 copy
--------CP r' M.MATRIX.DOUBLE _mVar6.MATRIX.DOUBLE 40
--------CP createvar _mVar7
scratch_space//_p32870_192.168.250.59//_t0/temp6 true MATRIX binarycell 100
1 -1 -1 -1 copy
--------CP rightIndex _mVar6.MATRIX.DOUBLE 1.SCALAR.INT.true
100.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true _mVar7.MATRIX.DOUBLE
--------CP rmvar _mVar6
--------CP createvar _mVar8
scratch_space//_p32870_192.168.250.59//_t0/temp7 true MATRIX binarycell 1 1
-1 -1 -1 copy
--------CP ba+* _mVar5.MATRIX.DOUBLE _mVar7.MATRIX.DOUBLE
_mVar8.MATRIX.DOUBLE 40
--------CP rmvar _mVar5 _mVar7
--------CP createvar _mVar9
scratch_space//_p32870_192.168.250.59//_t0/temp8 true MATRIX binarycell 1 1
-1 -1 -1 copy
--------CP rightIndex _mVar8.MATRIX.DOUBLE 1.SCALAR.INT.true
1.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true _mVar9.MATRIX.DOUBLE
--------CP rmvar _mVar8
--------CP castdts _mVar9.MATRIX.DOUBLE.false _Var10.SCALAR.STRING
--------CP rmvar _mVar9
--------CP print _Var3.SCALAR.STRING.false _Var11.SCALAR.STRING
--------CP rmvar _Var3
--------CP print _Var10.SCALAR.STRING.false _Var12.SCALAR.STRING
--------CP rmvar _Var10
--------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true stop
--------CP rmvar _Var11 _Var12
------GENERIC (lines 13-16) [recompile=false]
--------CP + EndTime:- .SCALAR.STRING.true stop.SCALAR.DOUBLE.false
_Var13.SCALAR.STRING
--------CP - stop.SCALAR.DOUBLE.false start.SCALAR.DOUBLE.false
_Var14.SCALAR.DOUBLE
--------CP / _Var14.SCALAR.DOUBLE.false 1000.SCALAR.INT.true
_Var15.SCALAR.DOUBLE
--------CP rmvar _Var14
--------CP + Time:.SCALAR.STRING.true _Var15.SCALAR.DOUBLE.false
_Var16.SCALAR.STRING
--------CP createvar _mVar17
scratch_space//_p32870_192.168.250.59//_t0/temp9 true MATRIX binarycell 5 1
-1 -1 -1 copy
--------CP leftIndex times.MATRIX.DOUBLE _Var15.SCALAR.DOUBLE.false
ix.SCALAR.INT.false ix.SCALAR.INT.false 1.SCALAR.INT.true 1.SCALAR.INT.true
_mVar17.MATRIX.DOUBLE
--------CP rmvar _Var15
--------CP print _Var13.SCALAR.STRING.false _Var18.SCALAR.STRING
--------CP rmvar _Var13
--------CP print _Var16.SCALAR.STRING.false _Var19.SCALAR.STRING
--------CP rmvar _Var16 times _Var18 _Var19
--------CP cpvar _mVar17 times
--------CP rmvar _mVar17 start stop
----GENERIC (lines 18-19) [recompile=false]
------CP createvar _mVar20
scratch_space//_p32870_192.168.250.59//_t0/temp10 true MATRIX binarycell 1
5 -1 -1 -1 copy
------CP r' times.MATRIX.DOUBLE _mVar20.MATRIX.DOUBLE 40
------CP rmvar times
------CP cpvar _mVar20 times
------CP rmvar _mVar20 times

SystemML Statistics:
Total elapsed time: 1.178 sec.
Total compilation time: 0.591 sec.
Total execution time: 0.587 sec.
Number of compiled MR Jobs: 0.
Number of executed MR Jobs: 0.
Cache hits (Mem, WB, FS, HDFS): 47/0/0/0.
Cache writes (WB, FS, HDFS): 12/0/0.
Cache times (ACQr/m, RLS, EXP): 0.001/0.001/0.002/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/0.
HOP DAGs recompile time: 0.000 sec.
Total JIT compile time: 29.332 sec.
Total JVM GC count: 8.
Total JVM GC time: 0.38 sec.
Heavy hitter instructions:
  #  Instruction  Time(s)  Count
  1  rand           0.307      3
  2  r'             0.256     11
  3  ba+*           0.010      5
  4  time           0.004     10
  5  rmvar          0.003     67
  6  rightIndex     0.002     15
  7  print          0.001     20
  8  createvar      0.001     39
  9  leftIndex      0.000      5
 10  +              0.000     15

*t(M%*%N) RUN TIME EXPLAIN *
# Memory Budget local/remote = 63716MB/140MB/140MB
# Degree of Parallelism (vcores) local/remote = 40/1/1
PROGRAM ( size CP/MR = 50/0 )
--FUNCTIONS
----FUNCTION CALL GRAPH
------MAIN PROGRAM
--------.defaultNS::time
----EXTERNAL FUNCTION .defaultNS::time
--MAIN PROGRAM
----GENERIC (lines 4-6) [recompile=false]
------CP createvar _mVar0 scratch_space//_p32616_192.168.250.59//_t0/temp0
true MATRIX binaryblock 5 1 1000 1000 0 copy
------CP rand 5.SCALAR.INT.true 1.SCALAR.INT.true 1000 1000 0.0 0.0 1.0 -1
uniform 1.0 40 _mVar0.MATRIX.DOUBLE
------CP createvar _mVar1 scratch_space//_p32616_192.168.250.59//_t0/temp1
true MATRIX binaryblock 20000 100 1000 1000 2000000 copy
------CP rand 20000.SCALAR.INT.true 100.SCALAR.INT.true 1000 1000 0.0 1.0
1.0 -1 uniform 1.0 40 _mVar1.MATRIX.DOUBLE
------CP createvar _mVar2 scratch_space//_p32616_192.168.250.59//_t0/temp2
true MATRIX binaryblock 100 20000 1000 1000 2000000 copy
------CP rand 100.SCALAR.INT.true 20000.SCALAR.INT.true 1000 1000 0.0 1.0
1.0 -1 uniform 1.0 40 _mVar2.MATRIX.DOUBLE
------CP cpvar _mVar0 times
------CP cpvar _mVar1 M
------CP cpvar _mVar2 N
------CP rmvar _mVar0 _mVar1 _mVar2
----FOR (lines 0-0) [in-place=[times]]
------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT
------CP assignvar 5.SCALAR.INT.true __pred.SCALAR.INT
------CP assignvar 1.SCALAR.INT.true __pred.SCALAR.INT
------GENERIC (lines 8-8) [recompile=false]
--------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true start
------GENERIC (lines 9-12) [recompile=false]
--------CP + StartTime: .SCALAR.STRING.true start.SCALAR.DOUBLE.false
_Var3.SCALAR.STRING
--------CP createvar _mVar4
scratch_space//_p32616_192.168.250.59//_t0/temp3 true MATRIX binarycell
20000 20000 -1 -1 -1 copy
--------CP ba+* M.MATRIX.DOUBLE N.MATRIX.DOUBLE _mVar4.MATRIX.DOUBLE 40
--------CP createvar _mVar5
scratch_space//_p32616_192.168.250.59//_t0/temp4 true MATRIX binarycell
20000 20000 -1 -1 -1 copy
--------CP r' _mVar4.MATRIX.DOUBLE _mVar5.MATRIX.DOUBLE 40
--------CP rmvar _mVar4
--------CP createvar _mVar6
scratch_space//_p32616_192.168.250.59//_t0/temp5 true MATRIX binarycell 1 1
-1 -1 -1 copy
--------CP rightIndex _mVar5.MATRIX.DOUBLE 1.SCALAR.INT.true
1.SCALAR.INT.true 1.SCALAR.INT.true 1.SCALAR.INT.true _mVar6.MATRIX.DOUBLE
--------CP rmvar _mVar5
--------CP castdts _mVar6.MATRIX.DOUBLE.false _Var7.SCALAR.STRING
--------CP rmvar _mVar6
--------CP print _Var3.SCALAR.STRING.false _Var8.SCALAR.STRING
--------CP rmvar _Var3
--------CP print _Var7.SCALAR.STRING.false _Var9.SCALAR.STRING
--------CP rmvar _Var7
--------CP extfunct .defaultNS time 1 1 i=1.SCALAR.INT.true end
--------CP rmvar _Var8 _Var9
------GENERIC (lines 13-16) [recompile=false]
--------CP + EndTime: .SCALAR.STRING.true end.SCALAR.DOUBLE.false
_Var10.SCALAR.STRING
--------CP - end.SCALAR.DOUBLE.false start.SCALAR.DOUBLE.false
_Var11.SCALAR.DOUBLE
--------CP / _Var11.SCALAR.DOUBLE.false 1000.SCALAR.INT.true
_Var12.SCALAR.DOUBLE
--------CP rmvar _Var11
--------CP + Time:.SCALAR.STRING.true _Var12.SCALAR.DOUBLE.false
_Var13.SCALAR.STRING
--------CP createvar _mVar14
scratch_space//_p32616_192.168.250.59//_t0/temp6 true MATRIX binarycell 5 1
-1 -1 -1 copy
--------CP leftIndex times.MATRIX.DOUBLE _Var12.SCALAR.DOUBLE.false
ix.SCALAR.INT.false ix.SCALAR.INT.false 1.SCALAR.INT.true 1.SCALAR.INT.true
_mVar14.MATRIX.DOUBLE
--------CP rmvar _Var12
--------CP print _Var10.SCALAR.STRING.false _Var15.SCALAR.STRING
--------CP rmvar _Var10
--------CP print _Var13.SCALAR.STRING.false _Var16.SCALAR.STRING
--------CP rmvar _Var13 times _Var15 _Var16
--------CP cpvar _mVar14 times
--------CP rmvar _mVar14 start end
----GENERIC (lines 18-19) [recompile=false]
------CP createvar _mVar17 scratch_space//_p32616_192.168.250.59//_t0/temp7
true MATRIX binarycell 1 5 -1 -1 -1 copy
------CP r' times.MATRIX.DOUBLE _mVar17.MATRIX.DOUBLE 40
------CP rmvar times
------CP cpvar _mVar17 times
------CP rmvar _mVar17 times

SystemML Statistics:
Total elapsed time: 22.074 sec.
Total compilation time: 0.567 sec.
Total execution time: 21.507 sec.
Number of compiled MR Jobs: 0.
Number of executed MR Jobs: 0.
Cache hits (Mem, WB, FS, HDFS): 32/0/0/0.
Cache writes (WB, FS, HDFS): 12/0/0.
Cache times (ACQr/m, RLS, EXP): 0.000/0.001/0.003/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/0.
HOP DAGs recompile time: 0.000 sec.
Total JIT compile time: 33.155 sec.
Total JVM GC count: 7.
Total JVM GC time: 0.279 sec.
Heavy hitter instructions:
  #  Instruction  Time(s)  Count
  1  ba+*          12.661      5
  2  r'             8.496      6
  3  rand           0.334      3
  4  time           0.006     10
  5  rmvar          0.003     57
  6  rightIndex     0.002      5
  7  print          0.001     20
  8  createvar      0.001     24
  9  +              0.000     15
 10  leftIndex      0.000      5




Thanks,
Rana

Reply via email to