[ https://issues.apache.org/jira/browse/SYSTEMML-633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15251013#comment-15251013 ]
Mike Dusenberry commented on SYSTEMML-633: ------------------------------------------ For another experiment, I created a script that approximates the idea of the {{conv::forward(...)}} function, particularly trying to isolate the matrix multiplication and the left-indexing in the parfor loop, similar to the previous experiment. In this case, I used no functions, and the update-in-place was successfully applied to the parfor loop, thus removing left-indexing as a bottleneck. {code} # Generate data N = 64 # num examples C = 30 # num channels Hin = 28 # input height Win = 28 # input width F = 20 # num filters Hf = 3 # filter height Wf = 3 # filter width stride = 1; strideh=stride; stridew=stride pad = 0; padh=pad; padw=pad max_iterations = 1 batch_size = 64 # Create layer W = rand(rows=F, cols=C*Hf*Wf, pdf="normal") * sqrt(2.0/(C*Hf*Wf)) b = matrix(0, rows=F, cols=1) Hout = as.integer((Hin + 2 * padh - Hf) / strideh + 1) Wout = as.integer((Win + 2 * padw - Wf) / stridew + 1) X = rand(rows=N, cols=C*Hf*Wf*Hout*Wout) foo = 0 for (iter in 1:max_iterations) { beg = ((iter-1) * batch_size) %% N + 1 end = min(N, beg + batch_size - 1) X_batch = X[beg:end,] # Create output volume out = matrix(0, rows=N, cols=F*Hout*Wout) # Forward parfor (n in 1:nrow(X_batch), log=DEBUG) { # all examples # Extract local image patches into columns with im2col, of shape (C*Hf*Wf, Hout*Wout) Xn_padded_cols = matrix(X_batch[n,], rows=C*Hf*Wf, cols=Hout*Wout) # Convolve patches with filters outn = W %*% Xn_padded_cols + b # shape (F, Hout*Wout) out[n,] = matrix(outn, rows=1, cols=F*Hout*Wout) # reshape } foo = foo + min(0.1, sum(out)) } print(foo) {code} {code} 16/04/20 17:25:02 INFO api.DMLScript: SystemML Statistics: Total elapsed time: 0.993 sec. Total compilation time: 0.278 sec. Total execution time: 0.715 sec. Number of compiled MR Jobs: 0. Number of executed MR Jobs: 0. Cache hits (Mem, WB, FS, HDFS): 460/0/0/0. Cache writes (WB, FS, HDFS): 261/0/0. Cache times (ACQr/m, RLS, EXP): 0.002/0.002/0.025/0.000 sec. HOP DAGs recompiled (PRED, SB): 0/0. HOP DAGs recompile time: 0.000 sec. ParFor loops optimized: 1. ParFor optimize time: 0.017 sec. ParFor initialize time: 0.014 sec. ParFor result merge time: 0.021 sec. ParFor total update in-place: 0/64/64 Total JIT compile time: 1.164 sec. Total JVM GC count: 4. Total JVM GC time: 0.023 sec. Heavy hitter instructions (name, time, count): -- 1) ba+* 0.392 sec 64 -- 2) rand 0.322 sec 4 -- 3) leftIndex 0.286 sec 64 -- 4) rangeReIndex 0.234 sec 65 -- 5) rshape 0.031 sec 128 -- 6) uak+ 0.013 sec 1 -- 7) rmvar 0.010 sec 538 -- 8) ncol 0.009 sec 129 -- 9) createvar 0.003 sec 326 -- 10) * 0.002 sec 5 {code} Increasing the {{max_iterations}} to {{100}} and the number of examples {{N}} to {{640}} continues this trend: {code} 16/04/20 17:28:30 INFO api.DMLScript: SystemML Statistics: Total elapsed time: 17.928 sec. Total compilation time: 0.276 sec. Total execution time: 17.652 sec. Number of compiled MR Jobs: 0. Number of executed MR Jobs: 0. Cache hits (Mem, WB, FS, HDFS): 45901/0/0/0. Cache writes (WB, FS, HDFS): 25802/1/0. Cache times (ACQr/m, RLS, EXP): 0.035/0.022/2.894/0.000 sec. HOP DAGs recompiled (PRED, SB): 0/0. HOP DAGs recompile time: 0.019 sec. ParFor loops optimized: 100. ParFor optimize time: 0.210 sec. ParFor initialize time: 0.120 sec. ParFor result merge time: 1.090 sec. ParFor total update in-place: 0/6400/6400 Total JIT compile time: 10.927 sec. Total JVM GC count: 55. Total JVM GC time: 1.356 sec. Heavy hitter instructions (name, time, count): -- 1) ba+* 26.353 sec 6400 -- 2) rangeReIndex 16.213 sec 6500 -- 3) leftIndex 10.236 sec 6400 -- 4) rand 4.576 sec 103 -- 5) uak+ 0.509 sec 100 -- 6) rshape 0.210 sec 12800 -- 7) createvar 0.083 sec 32204 -- 8) rmvar 0.080 sec 52711 -- 9) ncol 0.037 sec 12900 -- 10) cpvar 0.009 sec 6603 {code} > Improve Left-Indexing Performance with (Nested) Parfor Loops > ------------------------------------------------------------ > > Key: SYSTEMML-633 > URL: https://issues.apache.org/jira/browse/SYSTEMML-633 > Project: SystemML > Issue Type: Improvement > Components: ParFor > Reporter: Mike Dusenberry > Attachments: Im2colWrapper.java, log.txt, systemml-nn.zip > > > In the experimental deep learning DML library I've been building > ([https://github.com/dusenberrymw/systemml-nn|https://github.com/dusenberrymw/systemml-nn]), > I've experienced severe bottlenecks due to *left-indexing* in parfor loops. > Here, I will highlight a few particular instances with simplified examples, > but the same issue is shared across many areas of the library, particularly > in the convolution and max pooling layers, and is exaggerated in real > use-cases. > *Quick note* on setup for any of the below experiments. Please grab a copy > of the above repo (particularly the {{nn}} directory), and run any > experiments with the {{nn}} package available at the base directory of the > experiment. > Scenario: *Convolution* > * In the library above, the forward pass of the convolution function > ([{{conv::forward(...)}} | > https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/layers/conv.dml#L8] > in {{nn/layers/conv.dml}}) essentially accepts a matrix {{X}} of images, a > matrix of weights {{W}}, and several other parameters corresponding to image > sizes, filter sizes, etc. It then loops through the images with a {{parfor}} > loop, and for each image it pads the image with {{util::pad_image}}, extracts > "patches" of the image into columns of a matrix in a sliding fashion across > the image with {{util::im2col}}, performs a matrix multiplication between the > matrix of patch columns and the weight matrix, and then saves the result into > a matrix defined outside of the parfor loop using left-indexing. > * Left-indexing has been identified as the bottleneck by a wide margin. > * Left-indexing is used in the main {{conv::forward(...)}} function in the > [last line in the parfor > loop|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/layers/conv.dml#L61], > in the > [{{util::pad_image(...)}}|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/util.dml#L196] > function used by {{conv::forward(...)}}, as well as in the > [{{util::im2col(...)}}|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/util.dml#L96] > function used by {{conv::forward(...)}}. > * Test script (assuming the {{nn}} package is available): > ** {{speed-633.dml}} {code} > source("nn/layers/conv.dml") as conv > source("nn/util.dml") as util > # Generate data > N = 64 # num examples > C = 30 # num channels > Hin = 28 # input height > Win = 28 # input width > F = 20 # num filters > Hf = 3 # filter height > Wf = 3 # filter width > stride = 1 > pad = 1 > X = rand(rows=N, cols=C*Hin*Win) > # Create layer > [W, b] = conv::init(F, C, Hf, Wf) > # Forward > [out, Hout, Wout] = conv::forward(X, W, b, C, Hin, Win, Hf, Wf, stride, > stride, pad, pad) > print("Out: " + nrow(out) + "x" + ncol(out)) > print("Hout: " + Hout) > print("Wout: " + Wout) > print("") > print(sum(out)) > {code} > * Invocation: > ** {{java -jar > $SYSTEMML_HOME/target/systemml-0.10.0-incubating-SNAPSHOT-standalone.jar -f > speed-633.dml -stats -explain -exec singlenode}} > * Stats output (modified to output up to 100 instructions): > ** {code} > ... > Total elapsed time: 26.834 sec. > Total compilation time: 0.529 sec. > Total execution time: 26.304 sec. > Number of compiled MR Jobs: 0. > Number of executed MR Jobs: 0. > Cache hits (Mem, WB, FS, HDFS): 9196235/0/0/0. > Cache writes (WB, FS, HDFS): 3070724/0/0. > Cache times (ACQr/m, RLS, EXP): 1.474/1.120/26.998/0.000 sec. > HOP DAGs recompiled (PRED, SB): 0/0. > HOP DAGs recompile time: 0.268 sec. > Functions recompiled: 129. > Functions recompile time: 0.841 sec. > ParFor loops optimized: 1. > ParFor optimize time: 0.032 sec. > ParFor initialize time: 0.015 sec. > ParFor result merge time: 0.028 sec. > ParFor total update in-place: 0/0/1559360 > Total JIT compile time: 14.235 sec. > Total JVM GC count: 94. > Total JVM GC time: 0.366 sec. > Heavy hitter instructions (name, time, count): > -- 1) leftIndex 41.670 sec 1559360 > -- 2) forward 26.212 sec 1 > -- 3) im2col_t45 25.919 sec 8 > -- 4) im2col_t41 25.850 sec 8 > -- 5) im2col_t48 25.831 sec 8 > -- 6) im2col_t43 25.752 sec 8 > -- 7) im2col_t44 25.736 sec 8 > -- 8) im2col_t42 25.695 sec 8 > -- 9) im2col_t47 25.691 sec 8 > -- 10) im2col_t46 25.519 sec 8 > -- 11) rangeReIndex 13.392 sec 3012544 > -- 12) rshape 8.197 sec 3064704 > -- 13) rmvar 4.988 sec 20363504 > -- 14) createvar 4.954 sec 7688965 > -- 15) ncol 1.148 sec 3014529 > -- 16) - 0.961 sec 3112834 > -- 17) + 0.878 sec 3124617 > -- 18) rand 0.839 sec 52228 > -- 19) * 0.480 sec 1764229 > -- 20) cpvar 0.366 sec 1607812 > -- 21) ba+* 0.257 sec 64 > -- 22) pad_image_t42 0.187 sec 8 > -- 23) pad_image_t47 0.181 sec 8 > -- 24) pad_image_t44 0.168 sec 8 > -- 25) pad_image_t46 0.164 sec 8 > -- 26) pad_image_t43 0.156 sec 8 > -- 27) pad_image_t48 0.153 sec 8 > -- 28) pad_image_t45 0.152 sec 8 > -- 29) pad_image_t41 0.152 sec 8 > -- 30) nrow 0.036 sec 50307 > -- 31) assignvar 0.016 sec 52235 > -- 32) uak+ 0.015 sec 1 > -- 33) castvti 0.000 sec 130 > -- 34) print 0.000 sec 5 > -- 35) / 0.000 sec 130 > -- 36) sqrt 0.000 sec 1 > {code} > ** *Full log file attached* (including a {{log=DEBUG}} modification to the > parfor loop in {{conv::forward(...)}}. > ** Note again that {{forward}}, {{im2col}}, and {{pad_image}} all involve > left-indexing. > * Other notes: > ** Further experiments involved replacing the {{util::im2col(...)}} function > with an external Java function using a basic, nested for-loop approach with > no regard for optimization. Compared with the fastest parfor DML version, I > experienced at least a *100x* speed improvement. When compared to the same > naive for-loop approach in DML, the speedup was even greater. > ** Even with this external version of {{im2col}}, and with padding disabled, > the left-indexing within the parfor loop of {{conv::forward(...)}} still > dominated the execution time, acting as the major bottleneck. > ** For all described experiments, logging indicated that parfor update in > place was *not* applied. -- This message was sent by Atlassian JIRA (v6.3.4#6332)