[jira] [Commented] (SYSTEMML-633) Improve Left-Indexing Performance with (Nested) Parfor Loops

Mike Dusenberry (JIRA) Wed, 20 Apr 2016 17:29:41 -0700

    [ 
https://issues.apache.org/jira/browse/SYSTEMML-633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15251013#comment-15251013
 ]


Mike Dusenberry commented on SYSTEMML-633:
------------------------------------------

For another experiment, I created a script that approximates the idea of the 
{{conv::forward(...)}} function, particularly trying to isolate the matrix 
multiplication and the left-indexing in the parfor loop, similar to the 
previous experiment.  In this case, I used no functions, and the 
update-in-place was successfully applied to the parfor loop, thus removing 
left-indexing as a bottleneck.

{code}
# Generate data
N = 64  # num examples
C = 30  # num channels
Hin = 28  # input height
Win = 28  # input width
F = 20  # num filters
Hf = 3  # filter height
Wf = 3  # filter width
stride = 1; strideh=stride; stridew=stride
pad = 0; padh=pad; padw=pad
max_iterations = 1
batch_size = 64

# Create layer
W = rand(rows=F, cols=C*Hf*Wf, pdf="normal") * sqrt(2.0/(C*Hf*Wf))
b = matrix(0, rows=F, cols=1)

Hout = as.integer((Hin + 2 * padh - Hf) / strideh + 1)
Wout = as.integer((Win + 2 * padw - Wf) / stridew + 1)

X = rand(rows=N, cols=C*Hf*Wf*Hout*Wout)

foo = 0
for (iter in 1:max_iterations) {
  beg = ((iter-1) * batch_size) %% N + 1
  end = min(N, beg + batch_size - 1)
  X_batch = X[beg:end,]

  # Create output volume
  out = matrix(0, rows=N, cols=F*Hout*Wout)

  # Forward
  parfor (n in 1:nrow(X_batch), log=DEBUG) {  # all examples
    # Extract local image patches into columns with im2col, of shape (C*Hf*Wf, 
Hout*Wout)
    Xn_padded_cols = matrix(X_batch[n,], rows=C*Hf*Wf, cols=Hout*Wout)

    # Convolve patches with filters
    outn = W %*% Xn_padded_cols + b  # shape (F, Hout*Wout)
    out[n,] = matrix(outn, rows=1, cols=F*Hout*Wout)  # reshape
  }

  foo = foo + min(0.1, sum(out))
}

print(foo)
{code}

{code}
16/04/20 17:25:02 INFO api.DMLScript: SystemML Statistics:
Total elapsed time:             0.993 sec.
Total compilation time:         0.278 sec.
Total execution time:           0.715 sec.
Number of compiled MR Jobs:     0.
Number of executed MR Jobs:     0.
Cache hits (Mem, WB, FS, HDFS): 460/0/0/0.
Cache writes (WB, FS, HDFS):    261/0/0.
Cache times (ACQr/m, RLS, EXP): 0.002/0.002/0.025/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/0.
HOP DAGs recompile time:        0.000 sec.
ParFor loops optimized:         1.
ParFor optimize time:           0.017 sec.
ParFor initialize time:         0.014 sec.
ParFor result merge time:       0.021 sec.
ParFor total update in-place:   0/64/64
Total JIT compile time:         1.164 sec.
Total JVM GC count:             4.
Total JVM GC time:              0.023 sec.
Heavy hitter instructions (name, time, count):
-- 1)   ba+*    0.392 sec       64
-- 2)   rand    0.322 sec       4
-- 3)   leftIndex       0.286 sec       64
-- 4)   rangeReIndex    0.234 sec       65
-- 5)   rshape  0.031 sec       128
-- 6)   uak+    0.013 sec       1
-- 7)   rmvar   0.010 sec       538
-- 8)   ncol    0.009 sec       129
-- 9)   createvar       0.003 sec       326
-- 10)  *       0.002 sec       5
{code}

Increasing the {{max_iterations}} to {{100}} and the number of examples {{N}} 
to {{640}} continues this trend:
{code}
16/04/20 17:28:30 INFO api.DMLScript: SystemML Statistics:
Total elapsed time:             17.928 sec.
Total compilation time:         0.276 sec.
Total execution time:           17.652 sec.
Number of compiled MR Jobs:     0.
Number of executed MR Jobs:     0.
Cache hits (Mem, WB, FS, HDFS): 45901/0/0/0.
Cache writes (WB, FS, HDFS):    25802/1/0.
Cache times (ACQr/m, RLS, EXP): 0.035/0.022/2.894/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/0.
HOP DAGs recompile time:        0.019 sec.
ParFor loops optimized:         100.
ParFor optimize time:           0.210 sec.
ParFor initialize time:         0.120 sec.
ParFor result merge time:       1.090 sec.
ParFor total update in-place:   0/6400/6400
Total JIT compile time:         10.927 sec.
Total JVM GC count:             55.
Total JVM GC time:              1.356 sec.
Heavy hitter instructions (name, time, count):
-- 1)   ba+*    26.353 sec      6400
-- 2)   rangeReIndex    16.213 sec      6500
-- 3)   leftIndex       10.236 sec      6400
-- 4)   rand    4.576 sec       103
-- 5)   uak+    0.509 sec       100
-- 6)   rshape  0.210 sec       12800
-- 7)   createvar       0.083 sec       32204
-- 8)   rmvar   0.080 sec       52711
-- 9)   ncol    0.037 sec       12900
-- 10)  cpvar   0.009 sec       6603
{code}

> Improve Left-Indexing Performance with (Nested) Parfor Loops
> ------------------------------------------------------------
>
>                 Key: SYSTEMML-633
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-633
>             Project: SystemML
>          Issue Type: Improvement
>          Components: ParFor
>            Reporter: Mike Dusenberry
>         Attachments: Im2colWrapper.java, log.txt, systemml-nn.zip
>
>
> In the experimental deep learning DML library I've been building 
> ([https://github.com/dusenberrymw/systemml-nn|https://github.com/dusenberrymw/systemml-nn]),
>  I've experienced severe bottlenecks due to *left-indexing* in parfor loops.  
> Here, I will highlight a few particular instances with simplified examples, 
> but the same issue is shared across many areas of the library, particularly 
> in the convolution and max pooling layers, and is exaggerated in real 
> use-cases.
> *Quick note* on setup for any of the below experiments.  Please grab a copy 
> of the above repo (particularly the {{nn}} directory), and run any 
> experiments with the {{nn}} package available at the base directory of the 
> experiment.
> Scenario: *Convolution*
> * In the library above, the forward pass of the convolution function 
> ([{{conv::forward(...)}} | 
> https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/layers/conv.dml#L8]
>  in {{nn/layers/conv.dml}}) essentially accepts a matrix {{X}} of images, a 
> matrix of weights {{W}}, and several other parameters corresponding to image 
> sizes, filter sizes, etc.  It then loops through the images with a {{parfor}} 
> loop, and for each image it pads the image with {{util::pad_image}}, extracts 
> "patches" of the image into columns of a matrix in a sliding fashion across 
> the image with {{util::im2col}}, performs a matrix multiplication between the 
> matrix of patch columns and the weight matrix, and then saves the result into 
> a matrix defined outside of the parfor loop using left-indexing.
> * Left-indexing has been identified as the bottleneck by a wide margin.
> * Left-indexing is used in the main {{conv::forward(...)}} function in the 
> [last line in the parfor 
> loop|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/layers/conv.dml#L61],
>  in the 
> [{{util::pad_image(...)}}|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/util.dml#L196]
>  function used by {{conv::forward(...)}}, as well as in the 
> [{{util::im2col(...)}}|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/util.dml#L96]
>  function used by {{conv::forward(...)}}.
> * Test script (assuming the {{nn}} package is available):
> ** {{speed-633.dml}} {code}
> source("nn/layers/conv.dml") as conv
> source("nn/util.dml") as util
> # Generate data
> N = 64  # num examples
> C = 30  # num channels
> Hin = 28  # input height
> Win = 28  # input width
> F = 20  # num filters
> Hf = 3  # filter height
> Wf = 3  # filter width
> stride = 1
> pad = 1
> X = rand(rows=N, cols=C*Hin*Win)
> # Create layer
> [W, b] = conv::init(F, C, Hf, Wf)
> # Forward
> [out, Hout, Wout] = conv::forward(X, W, b, C, Hin, Win, Hf, Wf, stride, 
> stride, pad, pad)
> print("Out: " + nrow(out) + "x" + ncol(out))
> print("Hout: " + Hout)
> print("Wout: " + Wout)
> print("")
> print(sum(out))
> {code}
> * Invocation:
> ** {{java -jar 
> $SYSTEMML_HOME/target/systemml-0.10.0-incubating-SNAPSHOT-standalone.jar -f 
> speed-633.dml -stats -explain -exec singlenode}}
> * Stats output (modified to output up to 100 instructions):
> ** {code}
> ...
> Total elapsed time:   26.834 sec.
> Total compilation time:   0.529 sec.
> Total execution time:   26.304 sec.
> Number of compiled MR Jobs: 0.
> Number of executed MR Jobs: 0.
> Cache hits (Mem, WB, FS, HDFS): 9196235/0/0/0.
> Cache writes (WB, FS, HDFS):  3070724/0/0.
> Cache times (ACQr/m, RLS, EXP): 1.474/1.120/26.998/0.000 sec.
> HOP DAGs recompiled (PRED, SB): 0/0.
> HOP DAGs recompile time:  0.268 sec.
> Functions recompiled:   129.
> Functions recompile time: 0.841 sec.
> ParFor loops optimized:   1.
> ParFor optimize time:   0.032 sec.
> ParFor initialize time:   0.015 sec.
> ParFor result merge time: 0.028 sec.
> ParFor total update in-place: 0/0/1559360
> Total JIT compile time:   14.235 sec.
> Total JVM GC count:   94.
> Total JVM GC time:    0.366 sec.
> Heavy hitter instructions (name, time, count):
> -- 1)   leftIndex   41.670 sec  1559360
> -- 2)   forward   26.212 sec  1
> -- 3)   im2col_t45  25.919 sec  8
> -- 4)   im2col_t41  25.850 sec  8
> -- 5)   im2col_t48  25.831 sec  8
> -- 6)   im2col_t43  25.752 sec  8
> -- 7)   im2col_t44  25.736 sec  8
> -- 8)   im2col_t42  25.695 sec  8
> -- 9)   im2col_t47  25.691 sec  8
> -- 10)  im2col_t46  25.519 sec  8
> -- 11)  rangeReIndex  13.392 sec  3012544
> -- 12)  rshape  8.197 sec   3064704
> -- 13)  rmvar   4.988 sec   20363504
> -- 14)  createvar   4.954 sec   7688965
> -- 15)  ncol  1.148 sec   3014529
> -- 16)  -   0.961 sec   3112834
> -- 17)  +   0.878 sec   3124617
> -- 18)  rand  0.839 sec   52228
> -- 19)  *   0.480 sec   1764229
> -- 20)  cpvar   0.366 sec   1607812
> -- 21)  ba+*  0.257 sec   64
> -- 22)  pad_image_t42   0.187 sec   8
> -- 23)  pad_image_t47   0.181 sec   8
> -- 24)  pad_image_t44   0.168 sec   8
> -- 25)  pad_image_t46   0.164 sec   8
> -- 26)  pad_image_t43   0.156 sec   8
> -- 27)  pad_image_t48   0.153 sec   8
> -- 28)  pad_image_t45   0.152 sec   8
> -- 29)  pad_image_t41   0.152 sec   8
> -- 30)  nrow  0.036 sec   50307
> -- 31)  assignvar   0.016 sec   52235
> -- 32)  uak+  0.015 sec   1
> -- 33)  castvti   0.000 sec   130
> -- 34)  print   0.000 sec   5
> -- 35)  /   0.000 sec   130
> -- 36)  sqrt  0.000 sec   1
> {code}
> ** *Full log file attached* (including a {{log=DEBUG}} modification to the 
> parfor loop in {{conv::forward(...)}}.
> ** Note again that {{forward}}, {{im2col}}, and {{pad_image}} all involve 
> left-indexing.
> * Other notes:
> ** Further experiments involved replacing the {{util::im2col(...)}} function 
> with an external Java function using a basic, nested for-loop approach with 
> no regard for optimization.  Compared with the fastest parfor DML version, I 
> experienced at least a *100x* speed improvement.  When compared to the same 
> naive for-loop approach in DML, the speedup was even greater.
> ** Even with this external version of {{im2col}}, and with padding disabled, 
> the left-indexing within the parfor loop of {{conv::forward(...)}} still 
> dominated the execution time, acting as the major bottleneck.
> ** For all described experiments, logging indicated that parfor update in 
> place was *not* applied. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SYSTEMML-633) Improve Left-Indexing Performance with (Nested) Parfor Loops

Reply via email to