[ https://issues.apache.org/jira/browse/SYSTEMML-633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15246306#comment-15246306 ]
Mike Dusenberry edited comment on SYSTEMML-633 at 4/18/16 6:51 PM: ------------------------------------------------------------------- To continue on the thread of the original slow left-indexing, I'd like to discuss more details of the briefly mentioned experiment with moving {{util::im2col(...)}} to an external Java function. I've attached the full {{Im2colWrapper.java}} class which can be placed in the {{org.apache.sysml.udf.lib}} package and compiled with the rest of SystemML. Below is a snippet of the main portion of the code: {code} Matrix imgM = (Matrix) getFunctionInput(0); double[][] img = imgM.getMatrixAsDoubleArray(); int C = (int) imgM.getNumRows(); int Hin = Integer.parseInt(((Scalar)getFunctionInput(1)).getValue()); int Win = Integer.parseInt(((Scalar)getFunctionInput(2)).getValue()); int Hf = Integer.parseInt(((Scalar)getFunctionInput(3)).getValue()); int Wf = Integer.parseInt(((Scalar)getFunctionInput(4)).getValue()); int strideh = Integer.parseInt(((Scalar)getFunctionInput(5)).getValue()); int stridew = Integer.parseInt(((Scalar) getFunctionInput(6)).getValue()); int Hout = (Hin - Hf) / strideh + 1; int Wout = (Win - Wf) / stridew + 1; double[][] img_cols = new double[C*Hf*Wf][Hout*Wout]; for (int c=0; c<C; c++) { // all channels for (int hf=0; hf<Hf; hf++) { // all filter rows for (int wf=0; wf<Wf; wf++) { // all filter columns for (int hout=0; hout<Hout; hout++) { // all output rows for (int wout=0; wout<Wout; wout++) { // all output columns int out_row = c*Hf*Wf + hf*Wf + wf; int out_col = hout*Wout + wout; int hin = hout*strideh + hf; int win = wout*stridew + wf; int in_col = hin*Win + win; img_cols[out_row][out_col] = img[c][in_col]; } } } } } {code} As you can see, it's a very simple, nested for-loop implementation, without much regard for performance (interestingly, it's also quite similar to existing performant implementations in the popular deep learning libraries, such as [Caffe's im2col|https://github.com/BVLC/caffe/blob/master/src/caffe/util/im2col.cpp#L19]). If we compile with the above Java class, and replace {{util::im2col(...)}} with an external function declaration, {code} im2col = externalFunction(matrix[double] img, int Hin, int Win, int Hf, int Wf, int strideh, int stridew) return (matrix[double] img_cols) implemented in (classname="org.apache.sysml.udf.lib.Im2colWrapper",exectype="mem") {code}, we will find that we can achieve a major speedup: {code} 16/04/18 10:48:52 INFO api.DMLScript: SystemML Statistics: Total elapsed time: 1.561 sec. Total compilation time: 0.757 sec. Total execution time: 0.804 sec. Number of compiled MR Jobs: 0. Number of executed MR Jobs: 0. Cache hits (Mem, WB, FS, HDFS): 14091/0/0/0. Cache writes (WB, FS, HDFS): 10052/0/0. Cache times (ACQr/m, RLS, EXP): 0.021/0.021/1.096/0.000 sec. HOP DAGs recompiled (PRED, SB): 0/0. HOP DAGs recompile time: 0.004 sec. Functions recompiled: 65. Functions recompile time: 0.480 sec. ParFor loops optimized: 1. ParFor optimize time: 0.026 sec. ParFor initialize time: 0.024 sec. ParFor result merge time: 0.020 sec. ParFor total update in-place: 0/0/3904 Total JIT compile time: 2.468 sec. Total JVM GC count: 8. Total JVM GC time: 0.062 sec. Heavy hitter instructions (name, time, count): -- 1) leftIndex 2.121 sec 3904 -- 2) forward 0.694 sec 1 -- 3) ba+* 0.416 sec 64 -- 4) pad_image_t17 0.396 sec 8 -- 5) pad_image_t20 0.380 sec 9 -- 6) pad_image_t18 0.365 sec 8 -- 7) pad_image_t23 0.357 sec 7 -- 8) pad_image_t22 0.348 sec 8 -- 9) pad_image_t19 0.328 sec 8 -- 10) pad_image_t21 0.328 sec 8 -- 11) pad_image_t24 0.322 sec 8 -- 12) rand 0.207 sec 1988 -- 13) rshape 0.197 sec 3968 -- 14) rangeReIndex 0.153 sec 1984 -- 15) im2colext_t24 0.129 sec 8 -- 16) im2colext_t22 0.121 sec 8 -- 17) im2colext_t21 0.114 sec 8 -- 18) im2colext_t23 0.108 sec 7 -- 19) im2colext_t20 0.100 sec 9 -- 20) + 0.098 sec 11783 -- 21) im2colext_t18 0.096 sec 8 -- 22) im2colext_t17 0.092 sec 8 -- 23) im2colext_t19 0.089 sec 8 -- 24) rmvar 0.066 sec 36107 -- 25) createvar 0.046 sec 11973 -- 26) uak+ 0.019 sec 1 -- 27) ncol 0.015 sec 3969 -- 28) * 0.013 sec 6178 -- 29) cpvar 0.003 sec 2116 -- 30) nrow 0.000 sec 67 -- 31) print 0.000 sec 5 -- 32) assignvar 0.000 sec 75 -- 33) sqrt 0.000 sec 1 -- 34) castvti 0.000 sec 2 -- 35) / 0.000 sec 2 -- 36) - 0.000 sec 2 {code}. The given example is not quite large enough to allow for the full *100x* speedup originally given, but larger examples do enjoy such speedup factors. Nonetheless, it is a major change. We can further assess this by disabling padding via setting {{pad = 0}} in the example script, and then editing {{conv::forward(...)}} as {code} ... # Convolution - im2col implementation parfor (n in 1:N, log=DEBUG) { # all examples Xn = matrix(X[n,], rows=C, cols=Hin*Win) # reshape # Pad image #Xn_padded = util::pad_image(Xn, Hin, Win, padh, padw) # shape (C, (Hin+2*padh)*(Win+2*padw)) Xn_padded = Xn # Extract local image patches into columns with im2col, of shape (C*Hf*Wf, Hout*Wout) Xn_padded_cols = util::im2col(Xn_padded, Hin+2*padh, Win+2*padw, Hf, Wf, strideh, stridew) # Convolve patches with filters outn = W %*% Xn_padded_cols + b # shape (F, Hout*Wout) out[n,] = matrix(outn, rows=1, cols=F*Hout*Wout) # reshape } ... {code}, which then gives the following performance {code} 16/04/18 11:31:02 INFO api.DMLScript: SystemML Statistics: Total elapsed time: 1.054 sec. Total compilation time: 0.669 sec. Total execution time: 0.385 sec. Number of compiled MR Jobs: 0. Number of executed MR Jobs: 0. Cache hits (Mem, WB, FS, HDFS): 651/0/0/0. Cache writes (WB, FS, HDFS): 452/0/0. Cache times (ACQr/m, RLS, EXP): 0.003/0.002/0.325/0.000 sec. HOP DAGs recompiled (PRED, SB): 0/0. HOP DAGs recompile time: 0.001 sec. Functions recompiled: 1. Functions recompile time: 0.009 sec. ParFor loops optimized: 1. ParFor optimize time: 0.017 sec. ParFor initialize time: 0.017 sec. ParFor result merge time: 0.018 sec. ParFor total update in-place: 0/0/64 Total JIT compile time: 1.313 sec. Total JVM GC count: 6. Total JVM GC time: 0.053 sec. Heavy hitter instructions (name, time, count): -- 1) leftIndex 0.514 sec 64 -- 2) ba+* 0.433 sec 64 -- 3) forward 0.286 sec 1 -- 4) im2colext_t13 0.095 sec 9 -- 5) + 0.088 sec 201 -- 6) im2colext_t10 0.083 sec 7 -- 7) im2colext_t9 0.073 sec 9 -- 8) im2colext_t16 0.067 sec 9 -- 9) im2colext_t12 0.064 sec 8 -- 10) rand 0.064 sec 4 -- 11) im2colext_t11 0.058 sec 7 -- 12) im2colext_t14 0.057 sec 7 -- 13) im2colext_t15 0.057 sec 8 -- 14) rangeReIndex 0.043 sec 64 -- 15) rshape 0.023 sec 128 -- 16) rmvar 0.018 sec 1200 -- 17) uak+ 0.012 sec 1 -- 18) createvar 0.004 sec 389 -- 19) * 0.003 sec 325 -- 20) cpvar 0.002 sec 132 -- 21) ncol 0.001 sec 129 -- 22) nrow 0.000 sec 3 -- 23) print 0.000 sec 5 -- 24) sqrt 0.000 sec 1 -- 25) assignvar 0.000 sec 11 -- 26) castvti 0.000 sec 2 -- 27) / 0.000 sec 2 -- 28) - 0.000 sec 2 {code}. Essentially, the single left-indexing operation at the end of the parfor loop in the modified {{conv::forward(...)}} takes more time than the matrix multiplication above it. Larger examples also tease this out further. Now, let's try using the equivalent DML implementation of {{util::im2col(...)}} as compared with the above Java external function: {code} im2col = function(matrix[double] img, int Hin, int Win, int Hf, int Wf, int strideh, int stridew) return (matrix[double] img_cols) { /* * Rearrange local image regions (patches) into columns. * * Assumes image has already been padded as necessary. * * Inputs: * - img: Input image, of shape (C, Hin*Win), where C is the number * of input channels (depth). * - Hin: Input height, including padding. * - Win: Input width, including padding. * - Hf: Filter height. * - Wf: Filter width. * - strideh: Stride over height. * - stridew: Stride over width. * * Outputs: * - img_cols: Local spatial regions (patches) of the image stretched * out into columns, of shape (C*Hf*Wf, Hout*Wout). */ C = nrow(img) Hout = as.integer((Hin - Hf) / strideh + 1) Wout = as.integer((Win - Wf) / stridew + 1) img_cols = matrix(0, rows=C*Hf*Wf, cols=Hout*Wout) # zeros for (c in 1:C) { # all channels for (hf in 1:Hf) { # all filter rows for (wf in 1:Wf) { # all filter columns for (hout in 1:Hout) { # all output rows for (wout in 1:Wout) { # all output columns out_row = (c-1)*Hf*Wf + (hf-1)*Wf + wf out_col = (hout-1)*Wout + wout hin = (hout-1) * strideh + hf win = (wout-1) * stridew + wf in_row = c in_col = (hin-1)*Win + win img_cols[out_row, out_col] = img[c, in_col] } } } } } } {code} {code} // performance to be added once the code finishes running {code} Overall, the point of this thread of focus is that as a user and given the premise of SystemML, I might expect to get roughly equal performance for a simple for-loop implementation in DML as compared to another language, and given that, I would expect increased performance when moving to a parfor implementation. Particularly in this case, the implementations in the other deep learning libraries keep it simple and take a similar for-loop approach and are able to maintain high performance, so a user might be even more inclined to have the previous thoughts with SystemML. was (Author: mwdus...@us.ibm.com): To continue on the thread of the original slow left-indexing, I'd like to discuss more details of the briefly mentioned experiment with moving {{util::im2col(...)}} to an external Java function. I've attached the full {{Im2colWrapper.java}} class which can be placed in the {{org.apache.sysml.udf.lib}} package and compiled with the rest of SystemML. Below is a snippet of the main portion of the code: {code} Matrix imgM = (Matrix) getFunctionInput(0); double[][] img = imgM.getMatrixAsDoubleArray(); int C = (int) imgM.getNumRows(); int Hin = Integer.parseInt(((Scalar)getFunctionInput(1)).getValue()); int Win = Integer.parseInt(((Scalar)getFunctionInput(2)).getValue()); int Hf = Integer.parseInt(((Scalar)getFunctionInput(3)).getValue()); int Wf = Integer.parseInt(((Scalar)getFunctionInput(4)).getValue()); int strideh = Integer.parseInt(((Scalar)getFunctionInput(5)).getValue()); int stridew = Integer.parseInt(((Scalar) getFunctionInput(6)).getValue()); int Hout = (Hin - Hf) / strideh + 1; int Wout = (Win - Wf) / stridew + 1; double[][] img_cols = new double[C*Hf*Wf][Hout*Wout]; for (int c=0; c<C; c++) { // all channels for (int hf=0; hf<Hf; hf++) { // all filter rows for (int wf=0; wf<Wf; wf++) { // all filter columns for (int hout=0; hout<Hout; hout++) { // all output rows for (int wout=0; wout<Wout; wout++) { // all output columns int out_row = c*Hf*Wf + hf*Wf + wf; int out_col = hout*Wout + wout; int hin = hout*strideh + hf; int win = wout*stridew + wf; int in_col = hin*Win + win; img_cols[out_row][out_col] = img[c][in_col]; } } } } } {code} As you can see, it's a very simple, nested for-loop implementation, without much regard for performance (interestingly, it's also quite similar to existing performant implementations in the popular deep learning libraries, such as [Caffe's im2col|https://github.com/BVLC/caffe/blob/master/src/caffe/util/im2col.cpp#L19]). If we compile with the above Java class, and replace {{util::im2col(...)}} with an external function declaration, {code} im2col = externalFunction(matrix[double] img, int Hin, int Win, int Hf, int Wf, int strideh, int stridew) return (matrix[double] img_cols) implemented in (classname="org.apache.sysml.udf.lib.Im2colWrapper",exectype="mem") {code}, we will find that we can achieve a major speedup: {code} 16/04/18 10:48:52 INFO api.DMLScript: SystemML Statistics: Total elapsed time: 1.561 sec. Total compilation time: 0.757 sec. Total execution time: 0.804 sec. Number of compiled MR Jobs: 0. Number of executed MR Jobs: 0. Cache hits (Mem, WB, FS, HDFS): 14091/0/0/0. Cache writes (WB, FS, HDFS): 10052/0/0. Cache times (ACQr/m, RLS, EXP): 0.021/0.021/1.096/0.000 sec. HOP DAGs recompiled (PRED, SB): 0/0. HOP DAGs recompile time: 0.004 sec. Functions recompiled: 65. Functions recompile time: 0.480 sec. ParFor loops optimized: 1. ParFor optimize time: 0.026 sec. ParFor initialize time: 0.024 sec. ParFor result merge time: 0.020 sec. ParFor total update in-place: 0/0/3904 Total JIT compile time: 2.468 sec. Total JVM GC count: 8. Total JVM GC time: 0.062 sec. Heavy hitter instructions (name, time, count): -- 1) leftIndex 2.121 sec 3904 -- 2) forward 0.694 sec 1 -- 3) ba+* 0.416 sec 64 -- 4) pad_image_t17 0.396 sec 8 -- 5) pad_image_t20 0.380 sec 9 -- 6) pad_image_t18 0.365 sec 8 -- 7) pad_image_t23 0.357 sec 7 -- 8) pad_image_t22 0.348 sec 8 -- 9) pad_image_t19 0.328 sec 8 -- 10) pad_image_t21 0.328 sec 8 -- 11) pad_image_t24 0.322 sec 8 -- 12) rand 0.207 sec 1988 -- 13) rshape 0.197 sec 3968 -- 14) rangeReIndex 0.153 sec 1984 -- 15) im2colext_t24 0.129 sec 8 -- 16) im2colext_t22 0.121 sec 8 -- 17) im2colext_t21 0.114 sec 8 -- 18) im2colext_t23 0.108 sec 7 -- 19) im2colext_t20 0.100 sec 9 -- 20) + 0.098 sec 11783 -- 21) im2colext_t18 0.096 sec 8 -- 22) im2colext_t17 0.092 sec 8 -- 23) im2colext_t19 0.089 sec 8 -- 24) rmvar 0.066 sec 36107 -- 25) createvar 0.046 sec 11973 -- 26) uak+ 0.019 sec 1 -- 27) ncol 0.015 sec 3969 -- 28) * 0.013 sec 6178 -- 29) cpvar 0.003 sec 2116 -- 30) nrow 0.000 sec 67 -- 31) print 0.000 sec 5 -- 32) assignvar 0.000 sec 75 -- 33) sqrt 0.000 sec 1 -- 34) castvti 0.000 sec 2 -- 35) / 0.000 sec 2 -- 36) - 0.000 sec 2 {code}. The given example is not quite large enough to allow for the full *100x* speedup originally given, but larger examples do enjoy such speedup factors. Nonetheless, it is a major change. We can further assess this by disabling padding via setting {{pad = 0}} in the example script, and then editing {{conv::forward(...)}} as {code} ... # Convolution - im2col implementation parfor (n in 1:N, log=DEBUG) { # all examples Xn = matrix(X[n,], rows=C, cols=Hin*Win) # reshape # Pad image #Xn_padded = util::pad_image(Xn, Hin, Win, padh, padw) # shape (C, (Hin+2*padh)*(Win+2*padw)) Xn_padded = Xn # Extract local image patches into columns with im2col, of shape (C*Hf*Wf, Hout*Wout) Xn_padded_cols = util::im2col(Xn_padded, Hin+2*padh, Win+2*padw, Hf, Wf, strideh, stridew) # Convolve patches with filters outn = W %*% Xn_padded_cols + b # shape (F, Hout*Wout) out[n,] = matrix(outn, rows=1, cols=F*Hout*Wout) # reshape } ... {code}, which then gives the following performance {code} 16/04/18 11:31:02 INFO api.DMLScript: SystemML Statistics: Total elapsed time: 1.054 sec. Total compilation time: 0.669 sec. Total execution time: 0.385 sec. Number of compiled MR Jobs: 0. Number of executed MR Jobs: 0. Cache hits (Mem, WB, FS, HDFS): 651/0/0/0. Cache writes (WB, FS, HDFS): 452/0/0. Cache times (ACQr/m, RLS, EXP): 0.003/0.002/0.325/0.000 sec. HOP DAGs recompiled (PRED, SB): 0/0. HOP DAGs recompile time: 0.001 sec. Functions recompiled: 1. Functions recompile time: 0.009 sec. ParFor loops optimized: 1. ParFor optimize time: 0.017 sec. ParFor initialize time: 0.017 sec. ParFor result merge time: 0.018 sec. ParFor total update in-place: 0/0/64 Total JIT compile time: 1.313 sec. Total JVM GC count: 6. Total JVM GC time: 0.053 sec. Heavy hitter instructions (name, time, count): -- 1) leftIndex 0.514 sec 64 -- 2) ba+* 0.433 sec 64 -- 3) forward 0.286 sec 1 -- 4) im2colext_t13 0.095 sec 9 -- 5) + 0.088 sec 201 -- 6) im2colext_t10 0.083 sec 7 -- 7) im2colext_t9 0.073 sec 9 -- 8) im2colext_t16 0.067 sec 9 -- 9) im2colext_t12 0.064 sec 8 -- 10) rand 0.064 sec 4 -- 11) im2colext_t11 0.058 sec 7 -- 12) im2colext_t14 0.057 sec 7 -- 13) im2colext_t15 0.057 sec 8 -- 14) rangeReIndex 0.043 sec 64 -- 15) rshape 0.023 sec 128 -- 16) rmvar 0.018 sec 1200 -- 17) uak+ 0.012 sec 1 -- 18) createvar 0.004 sec 389 -- 19) * 0.003 sec 325 -- 20) cpvar 0.002 sec 132 -- 21) ncol 0.001 sec 129 -- 22) nrow 0.000 sec 3 -- 23) print 0.000 sec 5 -- 24) sqrt 0.000 sec 1 -- 25) assignvar 0.000 sec 11 -- 26) castvti 0.000 sec 2 -- 27) / 0.000 sec 2 -- 28) - 0.000 sec 2 {code}. Essentially, the single left-indexing operation at the end of the parfor loop in the modified {{conv::forward(...)}} takes more time than the matrix multiplication above it. Larger examples also tease this out further. Now, let's try using the equivalent DML implementation of {{util::im2col(...)}} as compared with the above Java external function: {code} im2col = function(matrix[double] img, int Hin, int Win, int Hf, int Wf, int strideh, int stridew) return (matrix[double] img_cols) { /* * Rearrange local image regions (patches) into columns. * * Assumes image has already been padded as necessary. * * Inputs: * - img: Input image, of shape (C, Hin*Win), where C is the number * of input channels (depth). * - Hin: Input height, including padding. * - Win: Input width, including padding. * - Hf: Filter height. * - Wf: Filter width. * - strideh: Stride over height. * - stridew: Stride over width. * * Outputs: * - img_cols: Local spatial regions (patches) of the image stretched * out into columns, of shape (C*Hf*Wf, Hout*Wout). */ C = nrow(img) Hout = as.integer((Hin - Hf) / strideh + 1) Wout = as.integer((Win - Wf) / stridew + 1) img_cols = matrix(0, rows=C*Hf*Wf, cols=Hout*Wout) # zeros for (c in 1:C) { # all channels for (hf in 1:Hf) { # all filter rows for (wf in 1:Wf) { # all filter columns for (hout in 1:Hout) { # all output rows for (wout in 1:Wout) { # all output columns out_row = (c-1)*Hf*Wf + (hf-1)*Wf + wf out_col = (hout-1)*Wout + wout hin = (hout-1) * strideh + hf win = (wout-1) * stridew + wf in_row = c in_col = (hin-1)*Win + win img_cols[out_row, out_col] = img[c, in_col] } } } } } } {code} // performance to be added once the code finishes running {code} Overall, the point of this thread of focus is that as a user and given the premise of SystemML, I might expect to get roughly equal performance for a simple for-loop implementation in DML as compared to another language, and given that, I would expect increased performance when moving to a parfor implementation. Particularly in this case, the implementations in the other deep learning libraries keep it simple and take a similar for-loop approach and are able to maintain high performance, so a user might be even more inclined to have the previous thoughts with SystemML. > Improve Left-Indexing Performance with (Nested) Parfor Loops > ------------------------------------------------------------ > > Key: SYSTEMML-633 > URL: https://issues.apache.org/jira/browse/SYSTEMML-633 > Project: SystemML > Issue Type: Improvement > Components: ParFor > Reporter: Mike Dusenberry > Attachments: Im2colWrapper.java, log.txt > > > In the experimental deep learning DML library I've been building > ([https://github.com/dusenberrymw/systemml-nn|https://github.com/dusenberrymw/systemml-nn]), > I've experienced severe bottlenecks due to *left-indexing* in parfor loops. > Here, I will highlight a few particular instances with simplified examples, > but the same issue is shared across many areas of the library, particularly > in the convolution and max pooling layers, and is exaggerated in real > use-cases. > *Quick note* on setup for any of the below experiments. Please grab a copy > of the above repo (particularly the {{nn}} directory), and run any > experiments with the {{nn}} package available at the base directory of the > experiment. > Scenario: *Convolution* > * In the library above, the forward pass of the convolution function > ([{{conv::forward(...)}} | > https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/layers/conv.dml#L8] > in {{nn/layers/conv.dml}}) essentially accepts a matrix {{X}} of images, a > matrix of weights {{W}}, and several other parameters corresponding to image > sizes, filter sizes, etc. It then loops through the images with a {{parfor}} > loop, and for each image it pads the image with {{util::pad_image}}, extracts > "patches" of the image into columns of a matrix in a sliding fashion across > the image with {{util::im2col}}, performs a matrix multiplication between the > matrix of patch columns and the weight matrix, and then saves the result into > a matrix defined outside of the parfor loop using left-indexing. > * Left-indexing has been identified as the bottleneck by a wide margin. > * Left-indexing is used in the main {{conv::forward(...)}} function in the > [last line in the parfor > loop|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/layers/conv.dml#L61], > in the > [{{util::pad_image(...)}}|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/util.dml#L196] > function used by {{conv::forward(...)}}, as well as in the > [{{util::im2col(...)}}|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/util.dml#L96] > function used by {{conv::forward(...)}}. > * Test script (assuming the {{nn}} package is available): > ** {{speed-633.dml}} {code} > source("nn/layers/conv.dml") as conv > source("nn/util.dml") as util > # Generate data > N = 64 # num examples > C = 30 # num channels > Hin = 28 # input height > Win = 28 # input width > F = 20 # num filters > Hf = 3 # filter height > Wf = 3 # filter width > stride = 1 > pad = 1 > X = rand(rows=N, cols=C*Hin*Win) > # Create layer > [W, b] = conv::init(F, C, Hf, Wf) > # Forward > [out, Hout, Wout] = conv::forward(X, W, b, C, Hin, Win, Hf, Wf, stride, > stride, pad, pad) > print("Out: " + nrow(out) + "x" + ncol(out)) > print("Hout: " + Hout) > print("Wout: " + Wout) > print("") > print(sum(out)) > {code} > * Invocation: > ** {{java -jar > $SYSTEMML_HOME/target/systemml-0.10.0-incubating-SNAPSHOT-standalone.jar -f > speed-633.dml -stats -explain -exec singlenode}} > * Stats output (modified to output up to 100 instructions): > ** {code} > ... > Total elapsed time: 26.834 sec. > Total compilation time: 0.529 sec. > Total execution time: 26.304 sec. > Number of compiled MR Jobs: 0. > Number of executed MR Jobs: 0. > Cache hits (Mem, WB, FS, HDFS): 9196235/0/0/0. > Cache writes (WB, FS, HDFS): 3070724/0/0. > Cache times (ACQr/m, RLS, EXP): 1.474/1.120/26.998/0.000 sec. > HOP DAGs recompiled (PRED, SB): 0/0. > HOP DAGs recompile time: 0.268 sec. > Functions recompiled: 129. > Functions recompile time: 0.841 sec. > ParFor loops optimized: 1. > ParFor optimize time: 0.032 sec. > ParFor initialize time: 0.015 sec. > ParFor result merge time: 0.028 sec. > ParFor total update in-place: 0/0/1559360 > Total JIT compile time: 14.235 sec. > Total JVM GC count: 94. > Total JVM GC time: 0.366 sec. > Heavy hitter instructions (name, time, count): > -- 1) leftIndex 41.670 sec 1559360 > -- 2) forward 26.212 sec 1 > -- 3) im2col_t45 25.919 sec 8 > -- 4) im2col_t41 25.850 sec 8 > -- 5) im2col_t48 25.831 sec 8 > -- 6) im2col_t43 25.752 sec 8 > -- 7) im2col_t44 25.736 sec 8 > -- 8) im2col_t42 25.695 sec 8 > -- 9) im2col_t47 25.691 sec 8 > -- 10) im2col_t46 25.519 sec 8 > -- 11) rangeReIndex 13.392 sec 3012544 > -- 12) rshape 8.197 sec 3064704 > -- 13) rmvar 4.988 sec 20363504 > -- 14) createvar 4.954 sec 7688965 > -- 15) ncol 1.148 sec 3014529 > -- 16) - 0.961 sec 3112834 > -- 17) + 0.878 sec 3124617 > -- 18) rand 0.839 sec 52228 > -- 19) * 0.480 sec 1764229 > -- 20) cpvar 0.366 sec 1607812 > -- 21) ba+* 0.257 sec 64 > -- 22) pad_image_t42 0.187 sec 8 > -- 23) pad_image_t47 0.181 sec 8 > -- 24) pad_image_t44 0.168 sec 8 > -- 25) pad_image_t46 0.164 sec 8 > -- 26) pad_image_t43 0.156 sec 8 > -- 27) pad_image_t48 0.153 sec 8 > -- 28) pad_image_t45 0.152 sec 8 > -- 29) pad_image_t41 0.152 sec 8 > -- 30) nrow 0.036 sec 50307 > -- 31) assignvar 0.016 sec 52235 > -- 32) uak+ 0.015 sec 1 > -- 33) castvti 0.000 sec 130 > -- 34) print 0.000 sec 5 > -- 35) / 0.000 sec 130 > -- 36) sqrt 0.000 sec 1 > {code} > ** *Full log file attached* (including a {{log=DEBUG}} modification to the > parfor loop in {{conv::forward(...)}}. > ** Note again that {{forward}}, {{im2col}}, and {{pad_image}} all involve > left-indexing. > * Other notes: > ** Further experiments involved replacing the {{util::im2col(...)}} function > with an external Java function using a basic, nested for-loop approach with > no regard for optimization. Compared with the fastest parfor DML version, I > experienced at least a *100x* speed improvement. When compared to the same > naive for-loop approach in DML, the speedup was even greater. > ** Even with this external version of {{im2col}}, and with padding disabled, > the left-indexing within the parfor loop of {{conv::forward(...)}} still > dominated the execution time, acting as the major bottleneck. > ** For all described experiments, logging indicated that parfor update in > place was *not* applied. -- This message was sent by Atlassian JIRA (v6.3.4#6332)