[
https://issues.apache.org/jira/browse/SYSTEMML-633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mike Dusenberry updated SYSTEMML-633:
-------------------------------------
Attachment: Im2colWrapper.java
> Improve Left-Indexing Performance with (Nested) Parfor Loops
> ------------------------------------------------------------
>
> Key: SYSTEMML-633
> URL: https://issues.apache.org/jira/browse/SYSTEMML-633
> Project: SystemML
> Issue Type: Improvement
> Components: ParFor
> Reporter: Mike Dusenberry
> Attachments: Im2colWrapper.java, log.txt
>
>
> In the experimental deep learning DML library I've been building
> ([https://github.com/dusenberrymw/systemml-nn|https://github.com/dusenberrymw/systemml-nn]),
> I've experienced severe bottlenecks due to *left-indexing* in parfor loops.
> Here, I will highlight a few particular instances with simplified examples,
> but the same issue is shared across many areas of the library, particularly
> in the convolution and max pooling layers, and is exaggerated in real
> use-cases.
> *Quick note* on setup for any of the below experiments. Please grab a copy
> of the above repo (particularly the {{nn}} directory), and run any
> experiments with the {{nn}} package available at the base directory of the
> experiment.
> Scenario: *Convolution*
> * In the library above, the forward pass of the convolution function
> ([{{conv::forward(...)}} |
> https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/layers/conv.dml#L8]
> in {{nn/layers/conv.dml}}) essentially accepts a matrix {{X}} of images, a
> matrix of weights {{W}}, and several other parameters corresponding to image
> sizes, filter sizes, etc. It then loops through the images with a {{parfor}}
> loop, and for each image it pads the image with {{util::pad_image}}, extracts
> "patches" of the image into columns of a matrix in a sliding fashion across
> the image with {{util::im2col}}, performs a matrix multiplication between the
> matrix of patch columns and the weight matrix, and then saves the result into
> a matrix defined outside of the parfor loop using left-indexing.
> * Left-indexing has been identified as the bottleneck by a wide margin.
> * Left-indexing is used in the main {{conv::forward(...)}} function in the
> [last line in the parfor
> loop|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/layers/conv.dml#L61],
> in the
> [{{util::pad_image(...)}}|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/util.dml#L196]
> function used by {{conv::forward(...)}}, as well as in the
> [{{util::im2col(...)}}|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/util.dml#L96]
> function used by {{conv::forward(...)}}.
> * Test script (assuming the {{nn}} package is available):
> ** {{speed-633.dml}} {code}
> source("nn/layers/conv.dml") as conv
> source("nn/util.dml") as util
> # Generate data
> N = 64 # num examples
> C = 30 # num channels
> Hin = 28 # input height
> Win = 28 # input width
> F = 20 # num filters
> Hf = 3 # filter height
> Wf = 3 # filter width
> stride = 1
> pad = 1
> X = rand(rows=N, cols=C*Hin*Win)
> # Create layer
> [W, b] = conv::init(F, C, Hf, Wf)
> # Forward
> [out, Hout, Wout] = conv::forward(X, W, b, C, Hin, Win, Hf, Wf, stride,
> stride, pad, pad)
> print("Out: " + nrow(out) + "x" + ncol(out))
> print("Hout: " + Hout)
> print("Wout: " + Wout)
> print("")
> print(sum(out))
> {code}
> * Invocation:
> ** {{java -jar
> $SYSTEMML_HOME/target/systemml-0.10.0-incubating-SNAPSHOT-standalone.jar -f
> speed-633.dml -stats -explain -exec singlenode}}
> * Stats output (modified to output up to 100 instructions):
> ** {code}
> ...
> Total elapsed time: 26.834 sec.
> Total compilation time: 0.529 sec.
> Total execution time: 26.304 sec.
> Number of compiled MR Jobs: 0.
> Number of executed MR Jobs: 0.
> Cache hits (Mem, WB, FS, HDFS): 9196235/0/0/0.
> Cache writes (WB, FS, HDFS): 3070724/0/0.
> Cache times (ACQr/m, RLS, EXP): 1.474/1.120/26.998/0.000 sec.
> HOP DAGs recompiled (PRED, SB): 0/0.
> HOP DAGs recompile time: 0.268 sec.
> Functions recompiled: 129.
> Functions recompile time: 0.841 sec.
> ParFor loops optimized: 1.
> ParFor optimize time: 0.032 sec.
> ParFor initialize time: 0.015 sec.
> ParFor result merge time: 0.028 sec.
> ParFor total update in-place: 0/0/1559360
> Total JIT compile time: 14.235 sec.
> Total JVM GC count: 94.
> Total JVM GC time: 0.366 sec.
> Heavy hitter instructions (name, time, count):
> -- 1) leftIndex 41.670 sec 1559360
> -- 2) forward 26.212 sec 1
> -- 3) im2col_t45 25.919 sec 8
> -- 4) im2col_t41 25.850 sec 8
> -- 5) im2col_t48 25.831 sec 8
> -- 6) im2col_t43 25.752 sec 8
> -- 7) im2col_t44 25.736 sec 8
> -- 8) im2col_t42 25.695 sec 8
> -- 9) im2col_t47 25.691 sec 8
> -- 10) im2col_t46 25.519 sec 8
> -- 11) rangeReIndex 13.392 sec 3012544
> -- 12) rshape 8.197 sec 3064704
> -- 13) rmvar 4.988 sec 20363504
> -- 14) createvar 4.954 sec 7688965
> -- 15) ncol 1.148 sec 3014529
> -- 16) - 0.961 sec 3112834
> -- 17) + 0.878 sec 3124617
> -- 18) rand 0.839 sec 52228
> -- 19) * 0.480 sec 1764229
> -- 20) cpvar 0.366 sec 1607812
> -- 21) ba+* 0.257 sec 64
> -- 22) pad_image_t42 0.187 sec 8
> -- 23) pad_image_t47 0.181 sec 8
> -- 24) pad_image_t44 0.168 sec 8
> -- 25) pad_image_t46 0.164 sec 8
> -- 26) pad_image_t43 0.156 sec 8
> -- 27) pad_image_t48 0.153 sec 8
> -- 28) pad_image_t45 0.152 sec 8
> -- 29) pad_image_t41 0.152 sec 8
> -- 30) nrow 0.036 sec 50307
> -- 31) assignvar 0.016 sec 52235
> -- 32) uak+ 0.015 sec 1
> -- 33) castvti 0.000 sec 130
> -- 34) print 0.000 sec 5
> -- 35) / 0.000 sec 130
> -- 36) sqrt 0.000 sec 1
> {code}
> ** *Full log file attached* (including a {{log=DEBUG}} modification to the
> parfor loop in {{conv::forward(...)}}.
> ** Note again that {{forward}}, {{im2col}}, and {{pad_image}} all involve
> left-indexing.
> * Other notes:
> ** Further experiments involved replacing the {{util::im2col(...)}} function
> with an external Java function using a basic, nested for-loop approach with
> no regard for optimization. Compared with the fastest parfor DML version, I
> experienced at least a *100x* speed improvement. When compared to the same
> naive for-loop approach in DML, the speedup was even greater.
> ** Even with this external version of {{im2col}}, and with padding disabled,
> the left-indexing within the parfor loop of {{conv::forward(...)}} still
> dominated the execution time, acting as the major bottleneck.
> ** For all described experiments, logging indicated that parfor update in
> place was *not* applied.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)