[jira] [Comment Edited] (SYSTEMML-633) Improve Left-Indexing Performance with (Nested) Parfor Loops

Mike Dusenberry (JIRA) Mon, 18 Apr 2016 11:53:31 -0700

    [ 
https://issues.apache.org/jira/browse/SYSTEMML-633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15246306#comment-15246306
 ]


Mike Dusenberry edited comment on SYSTEMML-633 at 4/18/16 6:51 PM:
-------------------------------------------------------------------

To continue on the thread of the original slow left-indexing, I'd like to 
discuss more details of the briefly mentioned experiment with moving 
{{util::im2col(...)}} to an external Java function.

I've attached the full {{Im2colWrapper.java}} class which can be placed in the 
{{org.apache.sysml.udf.lib}} package and compiled with the rest of SystemML.  
Below is a snippet of the main portion of the code:
{code}
            Matrix imgM = (Matrix) getFunctionInput(0);
            double[][] img = imgM.getMatrixAsDoubleArray();
            int C = (int) imgM.getNumRows();
            int Hin = 
Integer.parseInt(((Scalar)getFunctionInput(1)).getValue());
            int Win = 
Integer.parseInt(((Scalar)getFunctionInput(2)).getValue());
            int Hf = Integer.parseInt(((Scalar)getFunctionInput(3)).getValue());
            int Wf = Integer.parseInt(((Scalar)getFunctionInput(4)).getValue());
            int strideh = 
Integer.parseInt(((Scalar)getFunctionInput(5)).getValue());
            int stridew = Integer.parseInt(((Scalar) 
getFunctionInput(6)).getValue());

            int Hout = (Hin - Hf) / strideh + 1;
            int Wout = (Win - Wf) / stridew + 1;

            double[][] img_cols = new double[C*Hf*Wf][Hout*Wout];
            for (int c=0; c<C; c++) {  // all channels
                for (int hf=0; hf<Hf; hf++) {  // all filter rows
                    for (int wf=0; wf<Wf; wf++) {  // all filter columns
                        for (int hout=0; hout<Hout; hout++) {  // all output 
rows
                            for (int wout=0; wout<Wout; wout++) {  // all 
output columns
                                int out_row = c*Hf*Wf + hf*Wf + wf;
                                int out_col = hout*Wout + wout;
                                int hin = hout*strideh + hf;
                                int win = wout*stridew + wf;
                                int in_col = hin*Win + win;
                                img_cols[out_row][out_col] = img[c][in_col];
                            }
                        }
                    }
                }
            }
{code}

As you can see, it's a very simple, nested for-loop implementation, without 
much regard for performance (interestingly, it's also quite similar to existing 
performant implementations in the popular deep learning libraries, such as 
[Caffe's 
im2col|https://github.com/BVLC/caffe/blob/master/src/caffe/util/im2col.cpp#L19]).
  If we compile with the above Java class, and replace {{util::im2col(...)}} 
with an external function declaration,
{code}
im2col = externalFunction(matrix[double] img, int Hin, int Win, int Hf, int Wf, 
int strideh, int stridew)
    return (matrix[double] img_cols)
    implemented in 
(classname="org.apache.sysml.udf.lib.Im2colWrapper",exectype="mem")
{code},
we will find that we can achieve a major speedup:
{code}
16/04/18 10:48:52 INFO api.DMLScript: SystemML Statistics:
Total elapsed time:             1.561 sec.
Total compilation time:         0.757 sec.
Total execution time:           0.804 sec.
Number of compiled MR Jobs:     0.
Number of executed MR Jobs:     0.
Cache hits (Mem, WB, FS, HDFS): 14091/0/0/0.
Cache writes (WB, FS, HDFS):    10052/0/0.
Cache times (ACQr/m, RLS, EXP): 0.021/0.021/1.096/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/0.
HOP DAGs recompile time:        0.004 sec.
Functions recompiled:           65.
Functions recompile time:       0.480 sec.
ParFor loops optimized:         1.
ParFor optimize time:           0.026 sec.
ParFor initialize time:         0.024 sec.
ParFor result merge time:       0.020 sec.
ParFor total update in-place:   0/0/3904
Total JIT compile time:         2.468 sec.
Total JVM GC count:             8.
Total JVM GC time:              0.062 sec.
Heavy hitter instructions (name, time, count):
-- 1)   leftIndex       2.121 sec       3904
-- 2)   forward         0.694 sec       1
-- 3)   ba+*    0.416 sec       64
-- 4)   pad_image_t17   0.396 sec       8
-- 5)   pad_image_t20   0.380 sec       9
-- 6)   pad_image_t18   0.365 sec       8
-- 7)   pad_image_t23   0.357 sec       7
-- 8)   pad_image_t22   0.348 sec       8
-- 9)   pad_image_t19   0.328 sec       8
-- 10)  pad_image_t21   0.328 sec       8
-- 11)  pad_image_t24   0.322 sec       8
-- 12)  rand    0.207 sec       1988
-- 13)  rshape  0.197 sec       3968
-- 14)  rangeReIndex    0.153 sec       1984
-- 15)  im2colext_t24   0.129 sec       8
-- 16)  im2colext_t22   0.121 sec       8
-- 17)  im2colext_t21   0.114 sec       8
-- 18)  im2colext_t23   0.108 sec       7
-- 19)  im2colext_t20   0.100 sec       9
-- 20)  +       0.098 sec       11783
-- 21)  im2colext_t18   0.096 sec       8
-- 22)  im2colext_t17   0.092 sec       8
-- 23)  im2colext_t19   0.089 sec       8
-- 24)  rmvar   0.066 sec       36107
-- 25)  createvar       0.046 sec       11973
-- 26)  uak+    0.019 sec       1
-- 27)  ncol    0.015 sec       3969
-- 28)  *       0.013 sec       6178
-- 29)  cpvar   0.003 sec       2116
-- 30)  nrow    0.000 sec       67
-- 31)  print   0.000 sec       5
-- 32)  assignvar       0.000 sec       75
-- 33)  sqrt    0.000 sec       1
-- 34)  castvti         0.000 sec       2
-- 35)  /       0.000 sec       2
-- 36)  -       0.000 sec       2
{code}.

The given example is not quite large enough to allow for the full *100x* 
speedup originally given, but larger examples do enjoy such speedup factors.  
Nonetheless, it is a major change.

We can further assess this by disabling padding via setting {{pad = 0}} in the 
example script, and then editing {{conv::forward(...)}} as
{code}
...
  # Convolution - im2col implementation
  parfor (n in 1:N, log=DEBUG) {  # all examples
    Xn = matrix(X[n,], rows=C, cols=Hin*Win)  # reshape

    # Pad image
    #Xn_padded = util::pad_image(Xn, Hin, Win, padh, padw)  # shape (C, 
(Hin+2*padh)*(Win+2*padw))
    Xn_padded = Xn

    # Extract local image patches into columns with im2col, of shape (C*Hf*Wf, 
Hout*Wout)
    Xn_padded_cols = util::im2col(Xn_padded, Hin+2*padh, Win+2*padw, Hf, Wf, 
strideh, stridew)

    # Convolve patches with filters
    outn = W %*% Xn_padded_cols + b  # shape (F, Hout*Wout)
    out[n,] = matrix(outn, rows=1, cols=F*Hout*Wout)  # reshape
  }
...
{code},
which then gives the following performance
{code}
16/04/18 11:31:02 INFO api.DMLScript: SystemML Statistics:
Total elapsed time:             1.054 sec.
Total compilation time:         0.669 sec.
Total execution time:           0.385 sec.
Number of compiled MR Jobs:     0.
Number of executed MR Jobs:     0.
Cache hits (Mem, WB, FS, HDFS): 651/0/0/0.
Cache writes (WB, FS, HDFS):    452/0/0.
Cache times (ACQr/m, RLS, EXP): 0.003/0.002/0.325/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/0.
HOP DAGs recompile time:        0.001 sec.
Functions recompiled:           1.
Functions recompile time:       0.009 sec.
ParFor loops optimized:         1.
ParFor optimize time:           0.017 sec.
ParFor initialize time:         0.017 sec.
ParFor result merge time:       0.018 sec.
ParFor total update in-place:   0/0/64
Total JIT compile time:         1.313 sec.
Total JVM GC count:             6.
Total JVM GC time:              0.053 sec.
Heavy hitter instructions (name, time, count):
-- 1)   leftIndex       0.514 sec       64
-- 2)   ba+*    0.433 sec       64
-- 3)   forward         0.286 sec       1
-- 4)   im2colext_t13   0.095 sec       9
-- 5)   +       0.088 sec       201
-- 6)   im2colext_t10   0.083 sec       7
-- 7)   im2colext_t9    0.073 sec       9
-- 8)   im2colext_t16   0.067 sec       9
-- 9)   im2colext_t12   0.064 sec       8
-- 10)  rand    0.064 sec       4
-- 11)  im2colext_t11   0.058 sec       7
-- 12)  im2colext_t14   0.057 sec       7
-- 13)  im2colext_t15   0.057 sec       8
-- 14)  rangeReIndex    0.043 sec       64
-- 15)  rshape  0.023 sec       128
-- 16)  rmvar   0.018 sec       1200
-- 17)  uak+    0.012 sec       1
-- 18)  createvar       0.004 sec       389
-- 19)  *       0.003 sec       325
-- 20)  cpvar   0.002 sec       132
-- 21)  ncol    0.001 sec       129
-- 22)  nrow    0.000 sec       3
-- 23)  print   0.000 sec       5
-- 24)  sqrt    0.000 sec       1
-- 25)  assignvar       0.000 sec       11
-- 26)  castvti         0.000 sec       2
-- 27)  /       0.000 sec       2
-- 28)  -       0.000 sec       2
{code}.

Essentially, the single left-indexing operation at the end of the parfor loop 
in the modified {{conv::forward(...)}} takes more time than the matrix 
multiplication above it.  Larger examples also tease this out further.

Now, let's try using the equivalent DML implementation of {{util::im2col(...)}} 
as compared with the above Java external function:
{code}
im2col = function(matrix[double] img, int Hin, int Win, int Hf, int Wf, int 
strideh, int stridew)
    return (matrix[double] img_cols) {
  /*
   * Rearrange local image regions (patches) into columns.
   *
   * Assumes image has already been padded as necessary.
   *
   * Inputs:
   *  - img: Input image, of shape (C, Hin*Win), where C is the number
   *      of input channels (depth).
   *  - Hin: Input height, including padding.
   *  - Win: Input width, including padding.
   *  - Hf: Filter height.
   *  - Wf: Filter width.
   *  - strideh: Stride over height.
   *  - stridew: Stride over width.
   *
   * Outputs:
   *  - img_cols: Local spatial regions (patches) of the image stretched
   *      out into columns, of shape (C*Hf*Wf, Hout*Wout).
   */
  C = nrow(img)
  Hout = as.integer((Hin - Hf) / strideh + 1)
  Wout = as.integer((Win - Wf) / stridew + 1)

  img_cols = matrix(0, rows=C*Hf*Wf, cols=Hout*Wout)  # zeros
  for (c in 1:C) {  # all channels
    for (hf in 1:Hf) {  # all filter rows
      for (wf in 1:Wf) {  # all filter columns
        for (hout in 1:Hout) {  # all output rows
          for (wout in 1:Wout) {  # all output columns
            out_row = (c-1)*Hf*Wf + (hf-1)*Wf + wf
            out_col = (hout-1)*Wout + wout
            hin = (hout-1) * strideh + hf
            win = (wout-1) * stridew + wf
            in_row = c
            in_col = (hin-1)*Win + win
            img_cols[out_row, out_col] = img[c, in_col]
          }
        }
      }
    }
  }
}
{code}
{code}
// performance to be added once the code finishes running
{code}

Overall, the point of this thread of focus is that as a user and given the 
premise of SystemML, I might expect to get roughly equal performance for a 
simple for-loop implementation in DML as compared to another language, and 
given that, I would expect increased performance when moving to a parfor 
implementation.  Particularly in this case, the implementations in the other 
deep learning libraries keep it simple and take a similar for-loop approach and 
are able to maintain high performance, so a user might be even more inclined to 
have the previous thoughts with SystemML.


was (Author: mwdus...@us.ibm.com):
To continue on the thread of the original slow left-indexing, I'd like to 
discuss more details of the briefly mentioned experiment with moving 
{{util::im2col(...)}} to an external Java function.

I've attached the full {{Im2colWrapper.java}} class which can be placed in the 
{{org.apache.sysml.udf.lib}} package and compiled with the rest of SystemML.  
Below is a snippet of the main portion of the code:
{code}
            Matrix imgM = (Matrix) getFunctionInput(0);
            double[][] img = imgM.getMatrixAsDoubleArray();
            int C = (int) imgM.getNumRows();
            int Hin = 
Integer.parseInt(((Scalar)getFunctionInput(1)).getValue());
            int Win = 
Integer.parseInt(((Scalar)getFunctionInput(2)).getValue());
            int Hf = Integer.parseInt(((Scalar)getFunctionInput(3)).getValue());
            int Wf = Integer.parseInt(((Scalar)getFunctionInput(4)).getValue());
            int strideh = 
Integer.parseInt(((Scalar)getFunctionInput(5)).getValue());
            int stridew = Integer.parseInt(((Scalar) 
getFunctionInput(6)).getValue());

            int Hout = (Hin - Hf) / strideh + 1;
            int Wout = (Win - Wf) / stridew + 1;

            double[][] img_cols = new double[C*Hf*Wf][Hout*Wout];
            for (int c=0; c<C; c++) {  // all channels
                for (int hf=0; hf<Hf; hf++) {  // all filter rows
                    for (int wf=0; wf<Wf; wf++) {  // all filter columns
                        for (int hout=0; hout<Hout; hout++) {  // all output 
rows
                            for (int wout=0; wout<Wout; wout++) {  // all 
output columns
                                int out_row = c*Hf*Wf + hf*Wf + wf;
                                int out_col = hout*Wout + wout;
                                int hin = hout*strideh + hf;
                                int win = wout*stridew + wf;
                                int in_col = hin*Win + win;
                                img_cols[out_row][out_col] = img[c][in_col];
                            }
                        }
                    }
                }
            }
{code}

As you can see, it's a very simple, nested for-loop implementation, without 
much regard for performance (interestingly, it's also quite similar to existing 
performant implementations in the popular deep learning libraries, such as 
[Caffe's 
im2col|https://github.com/BVLC/caffe/blob/master/src/caffe/util/im2col.cpp#L19]).
  If we compile with the above Java class, and replace {{util::im2col(...)}} 
with an external function declaration,
{code}
im2col = externalFunction(matrix[double] img, int Hin, int Win, int Hf, int Wf, 
int strideh, int stridew)
    return (matrix[double] img_cols)
    implemented in 
(classname="org.apache.sysml.udf.lib.Im2colWrapper",exectype="mem")
{code},
we will find that we can achieve a major speedup:
{code}
16/04/18 10:48:52 INFO api.DMLScript: SystemML Statistics:
Total elapsed time:             1.561 sec.
Total compilation time:         0.757 sec.
Total execution time:           0.804 sec.
Number of compiled MR Jobs:     0.
Number of executed MR Jobs:     0.
Cache hits (Mem, WB, FS, HDFS): 14091/0/0/0.
Cache writes (WB, FS, HDFS):    10052/0/0.
Cache times (ACQr/m, RLS, EXP): 0.021/0.021/1.096/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/0.
HOP DAGs recompile time:        0.004 sec.
Functions recompiled:           65.
Functions recompile time:       0.480 sec.
ParFor loops optimized:         1.
ParFor optimize time:           0.026 sec.
ParFor initialize time:         0.024 sec.
ParFor result merge time:       0.020 sec.
ParFor total update in-place:   0/0/3904
Total JIT compile time:         2.468 sec.
Total JVM GC count:             8.
Total JVM GC time:              0.062 sec.
Heavy hitter instructions (name, time, count):
-- 1)   leftIndex       2.121 sec       3904
-- 2)   forward         0.694 sec       1
-- 3)   ba+*    0.416 sec       64
-- 4)   pad_image_t17   0.396 sec       8
-- 5)   pad_image_t20   0.380 sec       9
-- 6)   pad_image_t18   0.365 sec       8
-- 7)   pad_image_t23   0.357 sec       7
-- 8)   pad_image_t22   0.348 sec       8
-- 9)   pad_image_t19   0.328 sec       8
-- 10)  pad_image_t21   0.328 sec       8
-- 11)  pad_image_t24   0.322 sec       8
-- 12)  rand    0.207 sec       1988
-- 13)  rshape  0.197 sec       3968
-- 14)  rangeReIndex    0.153 sec       1984
-- 15)  im2colext_t24   0.129 sec       8
-- 16)  im2colext_t22   0.121 sec       8
-- 17)  im2colext_t21   0.114 sec       8
-- 18)  im2colext_t23   0.108 sec       7
-- 19)  im2colext_t20   0.100 sec       9
-- 20)  +       0.098 sec       11783
-- 21)  im2colext_t18   0.096 sec       8
-- 22)  im2colext_t17   0.092 sec       8
-- 23)  im2colext_t19   0.089 sec       8
-- 24)  rmvar   0.066 sec       36107
-- 25)  createvar       0.046 sec       11973
-- 26)  uak+    0.019 sec       1
-- 27)  ncol    0.015 sec       3969
-- 28)  *       0.013 sec       6178
-- 29)  cpvar   0.003 sec       2116
-- 30)  nrow    0.000 sec       67
-- 31)  print   0.000 sec       5
-- 32)  assignvar       0.000 sec       75
-- 33)  sqrt    0.000 sec       1
-- 34)  castvti         0.000 sec       2
-- 35)  /       0.000 sec       2
-- 36)  -       0.000 sec       2
{code}.

The given example is not quite large enough to allow for the full *100x* 
speedup originally given, but larger examples do enjoy such speedup factors.  
Nonetheless, it is a major change.

We can further assess this by disabling padding via setting {{pad = 0}} in the 
example script, and then editing {{conv::forward(...)}} as
{code}
...
  # Convolution - im2col implementation
  parfor (n in 1:N, log=DEBUG) {  # all examples
    Xn = matrix(X[n,], rows=C, cols=Hin*Win)  # reshape

    # Pad image
    #Xn_padded = util::pad_image(Xn, Hin, Win, padh, padw)  # shape (C, 
(Hin+2*padh)*(Win+2*padw))
    Xn_padded = Xn

    # Extract local image patches into columns with im2col, of shape (C*Hf*Wf, 
Hout*Wout)
    Xn_padded_cols = util::im2col(Xn_padded, Hin+2*padh, Win+2*padw, Hf, Wf, 
strideh, stridew)

    # Convolve patches with filters
    outn = W %*% Xn_padded_cols + b  # shape (F, Hout*Wout)
    out[n,] = matrix(outn, rows=1, cols=F*Hout*Wout)  # reshape
  }
...
{code},
which then gives the following performance
{code}
16/04/18 11:31:02 INFO api.DMLScript: SystemML Statistics:
Total elapsed time:             1.054 sec.
Total compilation time:         0.669 sec.
Total execution time:           0.385 sec.
Number of compiled MR Jobs:     0.
Number of executed MR Jobs:     0.
Cache hits (Mem, WB, FS, HDFS): 651/0/0/0.
Cache writes (WB, FS, HDFS):    452/0/0.
Cache times (ACQr/m, RLS, EXP): 0.003/0.002/0.325/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/0.
HOP DAGs recompile time:        0.001 sec.
Functions recompiled:           1.
Functions recompile time:       0.009 sec.
ParFor loops optimized:         1.
ParFor optimize time:           0.017 sec.
ParFor initialize time:         0.017 sec.
ParFor result merge time:       0.018 sec.
ParFor total update in-place:   0/0/64
Total JIT compile time:         1.313 sec.
Total JVM GC count:             6.
Total JVM GC time:              0.053 sec.
Heavy hitter instructions (name, time, count):
-- 1)   leftIndex       0.514 sec       64
-- 2)   ba+*    0.433 sec       64
-- 3)   forward         0.286 sec       1
-- 4)   im2colext_t13   0.095 sec       9
-- 5)   +       0.088 sec       201
-- 6)   im2colext_t10   0.083 sec       7
-- 7)   im2colext_t9    0.073 sec       9
-- 8)   im2colext_t16   0.067 sec       9
-- 9)   im2colext_t12   0.064 sec       8
-- 10)  rand    0.064 sec       4
-- 11)  im2colext_t11   0.058 sec       7
-- 12)  im2colext_t14   0.057 sec       7
-- 13)  im2colext_t15   0.057 sec       8
-- 14)  rangeReIndex    0.043 sec       64
-- 15)  rshape  0.023 sec       128
-- 16)  rmvar   0.018 sec       1200
-- 17)  uak+    0.012 sec       1
-- 18)  createvar       0.004 sec       389
-- 19)  *       0.003 sec       325
-- 20)  cpvar   0.002 sec       132
-- 21)  ncol    0.001 sec       129
-- 22)  nrow    0.000 sec       3
-- 23)  print   0.000 sec       5
-- 24)  sqrt    0.000 sec       1
-- 25)  assignvar       0.000 sec       11
-- 26)  castvti         0.000 sec       2
-- 27)  /       0.000 sec       2
-- 28)  -       0.000 sec       2
{code}.

Essentially, the single left-indexing operation at the end of the parfor loop 
in the modified {{conv::forward(...)}} takes more time than the matrix 
multiplication above it.  Larger examples also tease this out further.

Now, let's try using the equivalent DML implementation of {{util::im2col(...)}} 
as compared with the above Java external function:
{code}
im2col = function(matrix[double] img, int Hin, int Win, int Hf, int Wf, int 
strideh, int stridew)
    return (matrix[double] img_cols) {
  /*
   * Rearrange local image regions (patches) into columns.
   *
   * Assumes image has already been padded as necessary.
   *
   * Inputs:
   *  - img: Input image, of shape (C, Hin*Win), where C is the number
   *      of input channels (depth).
   *  - Hin: Input height, including padding.
   *  - Win: Input width, including padding.
   *  - Hf: Filter height.
   *  - Wf: Filter width.
   *  - strideh: Stride over height.
   *  - stridew: Stride over width.
   *
   * Outputs:
   *  - img_cols: Local spatial regions (patches) of the image stretched
   *      out into columns, of shape (C*Hf*Wf, Hout*Wout).
   */
  C = nrow(img)
  Hout = as.integer((Hin - Hf) / strideh + 1)
  Wout = as.integer((Win - Wf) / stridew + 1)

  img_cols = matrix(0, rows=C*Hf*Wf, cols=Hout*Wout)  # zeros
  for (c in 1:C) {  # all channels
    for (hf in 1:Hf) {  # all filter rows
      for (wf in 1:Wf) {  # all filter columns
        for (hout in 1:Hout) {  # all output rows
          for (wout in 1:Wout) {  # all output columns
            out_row = (c-1)*Hf*Wf + (hf-1)*Wf + wf
            out_col = (hout-1)*Wout + wout
            hin = (hout-1) * strideh + hf
            win = (wout-1) * stridew + wf
            in_row = c
            in_col = (hin-1)*Win + win
            img_cols[out_row, out_col] = img[c, in_col]
          }
        }
      }
    }
  }
}
{code}
// performance to be added once the code finishes running
{code}

Overall, the point of this thread of focus is that as a user and given the 
premise of SystemML, I might expect to get roughly equal performance for a 
simple for-loop implementation in DML as compared to another language, and 
given that, I would expect increased performance when moving to a parfor 
implementation.  Particularly in this case, the implementations in the other 
deep learning libraries keep it simple and take a similar for-loop approach and 
are able to maintain high performance, so a user might be even more inclined to 
have the previous thoughts with SystemML.

> Improve Left-Indexing Performance with (Nested) Parfor Loops
> ------------------------------------------------------------
>
>                 Key: SYSTEMML-633
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-633
>             Project: SystemML
>          Issue Type: Improvement
>          Components: ParFor
>            Reporter: Mike Dusenberry
>         Attachments: Im2colWrapper.java, log.txt
>
>
> In the experimental deep learning DML library I've been building 
> ([https://github.com/dusenberrymw/systemml-nn|https://github.com/dusenberrymw/systemml-nn]),
>  I've experienced severe bottlenecks due to *left-indexing* in parfor loops.  
> Here, I will highlight a few particular instances with simplified examples, 
> but the same issue is shared across many areas of the library, particularly 
> in the convolution and max pooling layers, and is exaggerated in real 
> use-cases.
> *Quick note* on setup for any of the below experiments.  Please grab a copy 
> of the above repo (particularly the {{nn}} directory), and run any 
> experiments with the {{nn}} package available at the base directory of the 
> experiment.
> Scenario: *Convolution*
> * In the library above, the forward pass of the convolution function 
> ([{{conv::forward(...)}} | 
> https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/layers/conv.dml#L8]
>  in {{nn/layers/conv.dml}}) essentially accepts a matrix {{X}} of images, a 
> matrix of weights {{W}}, and several other parameters corresponding to image 
> sizes, filter sizes, etc.  It then loops through the images with a {{parfor}} 
> loop, and for each image it pads the image with {{util::pad_image}}, extracts 
> "patches" of the image into columns of a matrix in a sliding fashion across 
> the image with {{util::im2col}}, performs a matrix multiplication between the 
> matrix of patch columns and the weight matrix, and then saves the result into 
> a matrix defined outside of the parfor loop using left-indexing.
> * Left-indexing has been identified as the bottleneck by a wide margin.
> * Left-indexing is used in the main {{conv::forward(...)}} function in the 
> [last line in the parfor 
> loop|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/layers/conv.dml#L61],
>  in the 
> [{{util::pad_image(...)}}|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/util.dml#L196]
>  function used by {{conv::forward(...)}}, as well as in the 
> [{{util::im2col(...)}}|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/util.dml#L96]
>  function used by {{conv::forward(...)}}.
> * Test script (assuming the {{nn}} package is available):
> ** {{speed-633.dml}} {code}
> source("nn/layers/conv.dml") as conv
> source("nn/util.dml") as util
> # Generate data
> N = 64  # num examples
> C = 30  # num channels
> Hin = 28  # input height
> Win = 28  # input width
> F = 20  # num filters
> Hf = 3  # filter height
> Wf = 3  # filter width
> stride = 1
> pad = 1
> X = rand(rows=N, cols=C*Hin*Win)
> # Create layer
> [W, b] = conv::init(F, C, Hf, Wf)
> # Forward
> [out, Hout, Wout] = conv::forward(X, W, b, C, Hin, Win, Hf, Wf, stride, 
> stride, pad, pad)
> print("Out: " + nrow(out) + "x" + ncol(out))
> print("Hout: " + Hout)
> print("Wout: " + Wout)
> print("")
> print(sum(out))
> {code}
> * Invocation:
> ** {{java -jar 
> $SYSTEMML_HOME/target/systemml-0.10.0-incubating-SNAPSHOT-standalone.jar -f 
> speed-633.dml -stats -explain -exec singlenode}}
> * Stats output (modified to output up to 100 instructions):
> ** {code}
> ...
> Total elapsed time:   26.834 sec.
> Total compilation time:   0.529 sec.
> Total execution time:   26.304 sec.
> Number of compiled MR Jobs: 0.
> Number of executed MR Jobs: 0.
> Cache hits (Mem, WB, FS, HDFS): 9196235/0/0/0.
> Cache writes (WB, FS, HDFS):  3070724/0/0.
> Cache times (ACQr/m, RLS, EXP): 1.474/1.120/26.998/0.000 sec.
> HOP DAGs recompiled (PRED, SB): 0/0.
> HOP DAGs recompile time:  0.268 sec.
> Functions recompiled:   129.
> Functions recompile time: 0.841 sec.
> ParFor loops optimized:   1.
> ParFor optimize time:   0.032 sec.
> ParFor initialize time:   0.015 sec.
> ParFor result merge time: 0.028 sec.
> ParFor total update in-place: 0/0/1559360
> Total JIT compile time:   14.235 sec.
> Total JVM GC count:   94.
> Total JVM GC time:    0.366 sec.
> Heavy hitter instructions (name, time, count):
> -- 1)   leftIndex   41.670 sec  1559360
> -- 2)   forward   26.212 sec  1
> -- 3)   im2col_t45  25.919 sec  8
> -- 4)   im2col_t41  25.850 sec  8
> -- 5)   im2col_t48  25.831 sec  8
> -- 6)   im2col_t43  25.752 sec  8
> -- 7)   im2col_t44  25.736 sec  8
> -- 8)   im2col_t42  25.695 sec  8
> -- 9)   im2col_t47  25.691 sec  8
> -- 10)  im2col_t46  25.519 sec  8
> -- 11)  rangeReIndex  13.392 sec  3012544
> -- 12)  rshape  8.197 sec   3064704
> -- 13)  rmvar   4.988 sec   20363504
> -- 14)  createvar   4.954 sec   7688965
> -- 15)  ncol  1.148 sec   3014529
> -- 16)  -   0.961 sec   3112834
> -- 17)  +   0.878 sec   3124617
> -- 18)  rand  0.839 sec   52228
> -- 19)  *   0.480 sec   1764229
> -- 20)  cpvar   0.366 sec   1607812
> -- 21)  ba+*  0.257 sec   64
> -- 22)  pad_image_t42   0.187 sec   8
> -- 23)  pad_image_t47   0.181 sec   8
> -- 24)  pad_image_t44   0.168 sec   8
> -- 25)  pad_image_t46   0.164 sec   8
> -- 26)  pad_image_t43   0.156 sec   8
> -- 27)  pad_image_t48   0.153 sec   8
> -- 28)  pad_image_t45   0.152 sec   8
> -- 29)  pad_image_t41   0.152 sec   8
> -- 30)  nrow  0.036 sec   50307
> -- 31)  assignvar   0.016 sec   52235
> -- 32)  uak+  0.015 sec   1
> -- 33)  castvti   0.000 sec   130
> -- 34)  print   0.000 sec   5
> -- 35)  /   0.000 sec   130
> -- 36)  sqrt  0.000 sec   1
> {code}
> ** *Full log file attached* (including a {{log=DEBUG}} modification to the 
> parfor loop in {{conv::forward(...)}}.
> ** Note again that {{forward}}, {{im2col}}, and {{pad_image}} all involve 
> left-indexing.
> * Other notes:
> ** Further experiments involved replacing the {{util::im2col(...)}} function 
> with an external Java function using a basic, nested for-loop approach with 
> no regard for optimization.  Compared with the fastest parfor DML version, I 
> experienced at least a *100x* speed improvement.  When compared to the same 
> naive for-loop approach in DML, the speedup was even greater.
> ** Even with this external version of {{im2col}}, and with padding disabled, 
> the left-indexing within the parfor loop of {{conv::forward(...)}} still 
> dominated the execution time, acting as the major bottleneck.
> ** For all described experiments, logging indicated that parfor update in 
> place was *not* applied. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (SYSTEMML-633) Improve Left-Indexing Performance with (Nested) Parfor Loops

Reply via email to