[ https://issues.apache.org/jira/browse/SYSTEMML-633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15338798#comment-15338798 ]
Matthias Boehm commented on SYSTEMML-633: ----------------------------------------- ok, here are the major findings from a more detailed look into this issue: (1) Update-in-place: The new basic loop update-in-place was not applied here because the pattern was unnecessarily restrictive. I now extended the conditions for this rewrite including safe operations like nrow/ncol and statement blocks without access to the update-in-place variables. Also, parfor local execution now also supports these kinds of (not pinned) update-in-place variables. (2) Row-major leftindexing im2col: In order to really benefit from the update-in-place, I would recommend to change the im2col function to something like the following snippet, which uses row-major updates on img_cols which avoids repeated reshifting in the presence of CSR representations (which are necessary to avoid serialization per update). This row major update is always beneficial as it also avoids writing multiple times to the same cache line for dense representations. {code} im2col = function(matrix[double] img, int Hin, int Win, int Hf, int Wf, int strideh, int stridew) return (matrix[double] img_cols) { C = nrow(img) Hout = as.integer((Hin - Hf) / strideh + 1) Wout = as.integer((Win - Wf) / stridew + 1) #img_cols = matrix(0, rows=C*Hf*Wf, cols=Hout*Wout) # zeros img_cols = matrix(0, rows=Hout*Wout, cols=C*Hf*Wf) # zeros parfor (hout in 1:Hout, check=0) { # all output rows hin = (hout-1) * strideh + 1 parfor (wout in 1:Wout, check=0) { # all output columns win = (wout-1) * stridew + 1 # Extract a local patch of the input image corresponding spatially to the filter sizes. img_patch = matrix(0, rows=C, cols=Hf*Wf) # zeros parfor (c in 1:C) { # all channels img_slice = matrix(img[c,], rows=Hin, cols=Win) # reshape img_patch[c,] = matrix(img_slice[hin:hin+Hf-1, win:win+Wf-1], rows=1, cols=Hf*Wf) } #img_cols[,(hout-1)*Wout + wout] = matrix(img_patch, rows=C*Hf*Wf, cols=1) # reshape img_cols[(hout-1)*Wout + wout,] = t(matrix(img_patch, rows=C*Hf*Wf, cols=1)) # reshape } } img_cols = t(img_cols); } {code} (3) Statistics: Due to very fine-grained access, the maintenance of statistics actually became a major bottleneck (thread contention due to multi-threaded updates). Accordingly, I would recommend to try out w/ -stats enabled but to disable statistics maintenance for the final experiments. Points (2) and (3) can potentially be handled automatically in the future via additional rewrites and thread-local statistics maintenance, but for now we would need to do these by hand. > Improve Left-Indexing Performance with (Nested) Parfor Loops in UDFs > -------------------------------------------------------------------- > > Key: SYSTEMML-633 > URL: https://issues.apache.org/jira/browse/SYSTEMML-633 > Project: SystemML > Issue Type: Improvement > Components: ParFor > Reporter: Mike Dusenberry > Priority: Blocker > Attachments: Im2colWrapper.java, log.txt, log.txt, log_06.11.16.txt, > perf-dml.dml, perf-tests.tar.gz, perf-tf.py, perf.sh, run.sh, > systemml-nn-05.16.16.zip, systemml-nn.zip, time.txt, time_06.11.16.txt > > > In the experimental deep learning DML library I've been building > ([https://github.com/dusenberrymw/systemml-nn|https://github.com/dusenberrymw/systemml-nn]), > I've experienced severe bottlenecks due to *left-indexing* in parfor loops. > Here, I will highlight a few particular instances with simplified examples, > but the same issue is shared across many areas of the library, particularly > in the convolution and max pooling layers, and is exaggerated in real > use-cases. > *Quick note* on setup for any of the below experiments. Please grab a copy > of the above repo (particularly the {{nn}} directory), and run any > experiments with the {{nn}} package available at the base directory of the > experiment. > Scenario: *Convolution* > * In the library above, the forward pass of the convolution function > ([{{conv::forward(...)}} | > https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/layers/conv.dml#L8] > in {{nn/layers/conv.dml}}) essentially accepts a matrix {{X}} of images, a > matrix of weights {{W}}, and several other parameters corresponding to image > sizes, filter sizes, etc. It then loops through the images with a {{parfor}} > loop, and for each image it pads the image with {{util::pad_image}}, extracts > "patches" of the image into columns of a matrix in a sliding fashion across > the image with {{util::im2col}}, performs a matrix multiplication between the > matrix of patch columns and the weight matrix, and then saves the result into > a matrix defined outside of the parfor loop using left-indexing. > * Left-indexing has been identified as the bottleneck by a wide margin. > * Left-indexing is used in the main {{conv::forward(...)}} function in the > [last line in the parfor > loop|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/layers/conv.dml#L61], > in the > [{{util::pad_image(...)}}|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/util.dml#L196] > function used by {{conv::forward(...)}}, as well as in the > [{{util::im2col(...)}}|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/util.dml#L96] > function used by {{conv::forward(...)}}. > * Test script (assuming the {{nn}} package is available): > ** {{speed-633.dml}} {code} > source("nn/layers/conv.dml") as conv > source("nn/util.dml") as util > # Generate data > N = 64 # num examples > C = 30 # num channels > Hin = 28 # input height > Win = 28 # input width > F = 20 # num filters > Hf = 3 # filter height > Wf = 3 # filter width > stride = 1 > pad = 1 > X = rand(rows=N, cols=C*Hin*Win) > # Create layer > [W, b] = conv::init(F, C, Hf, Wf) > # Forward > [out, Hout, Wout] = conv::forward(X, W, b, C, Hin, Win, Hf, Wf, stride, > stride, pad, pad) > print("Out: " + nrow(out) + "x" + ncol(out)) > print("Hout: " + Hout) > print("Wout: " + Wout) > print("") > print(sum(out)) > {code} > * Invocation: > ** {{java -jar > $SYSTEMML_HOME/target/systemml-0.10.0-incubating-SNAPSHOT-standalone.jar -f > speed-633.dml -stats -explain -exec singlenode}} > * Stats output (modified to output up to 100 instructions): > ** {code} > ... > Total elapsed time: 26.834 sec. > Total compilation time: 0.529 sec. > Total execution time: 26.304 sec. > Number of compiled MR Jobs: 0. > Number of executed MR Jobs: 0. > Cache hits (Mem, WB, FS, HDFS): 9196235/0/0/0. > Cache writes (WB, FS, HDFS): 3070724/0/0. > Cache times (ACQr/m, RLS, EXP): 1.474/1.120/26.998/0.000 sec. > HOP DAGs recompiled (PRED, SB): 0/0. > HOP DAGs recompile time: 0.268 sec. > Functions recompiled: 129. > Functions recompile time: 0.841 sec. > ParFor loops optimized: 1. > ParFor optimize time: 0.032 sec. > ParFor initialize time: 0.015 sec. > ParFor result merge time: 0.028 sec. > ParFor total update in-place: 0/0/1559360 > Total JIT compile time: 14.235 sec. > Total JVM GC count: 94. > Total JVM GC time: 0.366 sec. > Heavy hitter instructions (name, time, count): > -- 1) leftIndex 41.670 sec 1559360 > -- 2) forward 26.212 sec 1 > -- 3) im2col_t45 25.919 sec 8 > -- 4) im2col_t41 25.850 sec 8 > -- 5) im2col_t48 25.831 sec 8 > -- 6) im2col_t43 25.752 sec 8 > -- 7) im2col_t44 25.736 sec 8 > -- 8) im2col_t42 25.695 sec 8 > -- 9) im2col_t47 25.691 sec 8 > -- 10) im2col_t46 25.519 sec 8 > -- 11) rangeReIndex 13.392 sec 3012544 > -- 12) rshape 8.197 sec 3064704 > -- 13) rmvar 4.988 sec 20363504 > -- 14) createvar 4.954 sec 7688965 > -- 15) ncol 1.148 sec 3014529 > -- 16) - 0.961 sec 3112834 > -- 17) + 0.878 sec 3124617 > -- 18) rand 0.839 sec 52228 > -- 19) * 0.480 sec 1764229 > -- 20) cpvar 0.366 sec 1607812 > -- 21) ba+* 0.257 sec 64 > -- 22) pad_image_t42 0.187 sec 8 > -- 23) pad_image_t47 0.181 sec 8 > -- 24) pad_image_t44 0.168 sec 8 > -- 25) pad_image_t46 0.164 sec 8 > -- 26) pad_image_t43 0.156 sec 8 > -- 27) pad_image_t48 0.153 sec 8 > -- 28) pad_image_t45 0.152 sec 8 > -- 29) pad_image_t41 0.152 sec 8 > -- 30) nrow 0.036 sec 50307 > -- 31) assignvar 0.016 sec 52235 > -- 32) uak+ 0.015 sec 1 > -- 33) castvti 0.000 sec 130 > -- 34) print 0.000 sec 5 > -- 35) / 0.000 sec 130 > -- 36) sqrt 0.000 sec 1 > {code} > ** *Full log file attached* (including a {{log=DEBUG}} modification to the > parfor loop in {{conv::forward(...)}}. > ** Note again that {{forward}}, {{im2col}}, and {{pad_image}} all involve > left-indexing. > * Other notes: > ** Further experiments involved replacing the {{util::im2col(...)}} function > with an external Java function using a basic, nested for-loop approach with > no regard for optimization. Compared with the fastest parfor DML version, I > experienced at least a *100x* speed improvement. When compared to the same > naive for-loop approach in DML, the speedup was even greater. > ** Even with this external version of {{im2col}}, and with padding disabled, > the left-indexing within the parfor loop of {{conv::forward(...)}} still > dominated the execution time, acting as the major bottleneck. > ** For all described experiments, logging indicated that parfor update in > place was *not* applied. -- This message was sent by Atlassian JIRA (v6.3.4#6332)