[ 
https://issues.apache.org/jira/browse/SYSTEMML-633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Dusenberry updated SYSTEMML-633:
-------------------------------------
    Description: 
In the experimental deep learning DML library I've been building 
([https://github.com/dusenberrymw/systemml-nn|https://github.com/dusenberrymw/systemml-nn]),
 I've experienced severe bottlenecks due to *left-indexing* in parfor loops.  
Here, I will highlight a few particular instances with simplified examples, but 
the same issue is shared across many areas of the library, particularly in the 
convolution and max pooling layers, and is exaggerated in real use-cases.

*Quick note* on setup for any of the below experiments.  Please grab a copy of 
the above repo (particularly the {{nn}} directory), and run any experiments 
with the {{nn}} package available at the base directory of the experiment.


Scenario: *Convolution*

* In the library above, the forward pass of the convolution function 
([{{conv::forward(...)}} | 
https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/layers/conv.dml#L8]
 in {{nn/layers/conv.dml}}) essentially accepts a matrix {{X}} of images, a 
matrix of weights {{W}}, and several other parameters corresponding to image 
sizes, filter sizes, etc.  It then loops through the images with a {{parfor}} 
loop, and for each image it pads the image with {{util::pad_image}}, extracts 
"patches" of the image into columns of a matrix in a sliding fashion across the 
image with {{util::im2col}}, performs a matrix multiplication between the 
matrix of patch columns and the weight matrix, and then saves the result into a 
matrix defined outside of the parfor loop using left-indexing.
* Left-indexing has been identified as the bottleneck by a wide margin.
* Left-indexing is used in the main {{conv::forward(...)}} function in the 
[last line in the parfor 
loop|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/layers/conv.dml#L61],
 in the 
[{{util::pad_image(...)}}|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/util.dml#L196]
 function used by {{conv::forward(...)}}, as well as in the 
[{{util::im2col(...)}}|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/util.dml#L96]
 function used by {{conv::forward(...)}}.
* Test script (assuming the {{nn}} package is available):
** {{speed-633.dml}} {code}
source("nn/layers/conv.dml") as conv
source("nn/util.dml") as util

# Generate data
N = 64  # num examples
C = 30  # num channels
Hin = 28  # input height
Win = 28  # input width
F = 20  # num filters
Hf = 3  # filter height
Wf = 3  # filter width
stride = 1
pad = 1
X = rand(rows=N, cols=C*Hin*Win)

# Create layer
[W, b] = conv::init(F, C, Hf, Wf)

# Forward
[out, Hout, Wout] = conv::forward(X, W, b, C, Hin, Win, Hf, Wf, stride, stride, 
pad, pad)

print("Out: " + nrow(out) + "x" + ncol(out))
print("Hout: " + Hout)
print("Wout: " + Wout)
print("")
print(sum(out))
{code}
* Invocation:
** {{java -jar 
$SYSTEMML_HOME/target/systemml-0.10.0-incubating-SNAPSHOT-standalone.jar -f 
speed-633.dml -stats -explain -exec singlenode}}
* Stats output (modified to output up to 100 instructions):
** {code}
...
Total elapsed time:   26.834 sec.
Total compilation time:   0.529 sec.
Total execution time:   26.304 sec.
Number of compiled MR Jobs: 0.
Number of executed MR Jobs: 0.
Cache hits (Mem, WB, FS, HDFS): 9196235/0/0/0.
Cache writes (WB, FS, HDFS):  3070724/0/0.
Cache times (ACQr/m, RLS, EXP): 1.474/1.120/26.998/0.000 sec.
HOP DAGs recompiled (PRED, SB): 0/0.
HOP DAGs recompile time:  0.268 sec.
Functions recompiled:   129.
Functions recompile time: 0.841 sec.
ParFor loops optimized:   1.
ParFor optimize time:   0.032 sec.
ParFor initialize time:   0.015 sec.
ParFor result merge time: 0.028 sec.
ParFor total update in-place: 0/0/1559360
Total JIT compile time:   14.235 sec.
Total JVM GC count:   94.
Total JVM GC time:    0.366 sec.
Heavy hitter instructions (name, time, count):
-- 1)   leftIndex   41.670 sec  1559360
-- 2)   forward   26.212 sec  1
-- 3)   im2col_t45  25.919 sec  8
-- 4)   im2col_t41  25.850 sec  8
-- 5)   im2col_t48  25.831 sec  8
-- 6)   im2col_t43  25.752 sec  8
-- 7)   im2col_t44  25.736 sec  8
-- 8)   im2col_t42  25.695 sec  8
-- 9)   im2col_t47  25.691 sec  8
-- 10)  im2col_t46  25.519 sec  8
-- 11)  rangeReIndex  13.392 sec  3012544
-- 12)  rshape  8.197 sec   3064704
-- 13)  rmvar   4.988 sec   20363504
-- 14)  createvar   4.954 sec   7688965
-- 15)  ncol  1.148 sec   3014529
-- 16)  -   0.961 sec   3112834
-- 17)  +   0.878 sec   3124617
-- 18)  rand  0.839 sec   52228
-- 19)  *   0.480 sec   1764229
-- 20)  cpvar   0.366 sec   1607812
-- 21)  ba+*  0.257 sec   64
-- 22)  pad_image_t42   0.187 sec   8
-- 23)  pad_image_t47   0.181 sec   8
-- 24)  pad_image_t44   0.168 sec   8
-- 25)  pad_image_t46   0.164 sec   8
-- 26)  pad_image_t43   0.156 sec   8
-- 27)  pad_image_t48   0.153 sec   8
-- 28)  pad_image_t45   0.152 sec   8
-- 29)  pad_image_t41   0.152 sec   8
-- 30)  nrow  0.036 sec   50307
-- 31)  assignvar   0.016 sec   52235
-- 32)  uak+  0.015 sec   1
-- 33)  castvti   0.000 sec   130
-- 34)  print   0.000 sec   5
-- 35)  /   0.000 sec   130
-- 36)  sqrt  0.000 sec   1
{code}
** *Full log file attached* (including a {{log=DEBUG}} modification to the 
parfor loop in {{conv::forward(...)}}.

  was:
In the experimental deep learning DML library I've been building 
([https://github.com/dusenberrymw/systemml-nn|https://github.com/dusenberrymw/systemml-nn]),
 I've experienced severe bottlenecks due to *left-indexing* in parfor loops.  
Here, I will highlight a few particular instances with simplified examples, but 
the same issue is shared across many areas of the library, particularly in the 
convolution and max pooling layers, and is exaggerated in real use-cases.

*Quick note* on setup for any of the below experiments.  Please grab a copy of 
the above repo (particularly the {{nn}} directory), and run any experiments 
with the {{nn}} package available at the base directory of the experiment.


Scenario: *Convolution*

* In the library above, the forward pass of the convolution function 
([{{conv::forward(...)}} | 
https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/layers/conv.dml#L8]
 in {{nn/layers/conv.dml}}) essentially accepts a matrix {{X}} of images, a 
matrix of weights {{W}}, and several other parameters corresponding to image 
sizes, filter sizes, etc.  It then loops through the images with a {{parfor}} 
loop, and for each image it pads the image with {{util::pad_image}}, extracts 
"patches" of the image into columns of a matrix in a sliding fashion across the 
image with {{util::im2col}}, performs a matrix multiplication between the 
matrix of patch columns and the weight matrix, and then saves the result into a 
matrix defined outside of the parfor loop using left-indexing.
* Left-indexing has been identified as the bottleneck by a wide margin.
* Left-indexing is used in the main {{conv::forward(...)}} function in the 
[last line in the parfor 
loop|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/layers/conv.dml#L61],
 in the 
[{{util::pad_image(...)}}|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/util.dml#L196]
 function used by {{conv::forward(...)}}, as well as in the 
[{{util::im2col(...)}}|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/util.dml#L96]
 function used by {{conv::forward(...)}}.
* Test script (assuming the {{nn}} package is available):
** {{speed-633.dml}} {code}
source("nn/layers/conv.dml") as conv
source("nn/util.dml") as util

# Generate data
N = 64  # num examples
C = 30  # num channels
Hin = 28  # input height
Win = 28  # input width
F = 20  # num filters
Hf = 3  # filter height
Wf = 3  # filter width
stride = 1
pad = 1
X = rand(rows=N, cols=C*Hin*Win)

# Create layer
[W, b] = conv::init(F, C, Hf, Wf)

# Forward
[out, Hout, Wout] = conv::forward(X, W, b, C, Hin, Win, Hf, Wf, stride, stride, 
pad, pad)

print("Out: " + nrow(out) + "x" + ncol(out))
print("Hout: " + Hout)
print("Wout: " + Wout)
print("")
print(sum(out))
{code}
* Invocation:
** {{java -jar 
$SYSTEMML_HOME/target/systemml-0.10.0-incubating-SNAPSHOT-standalone.jar -f 
speed-633.dml -stats -explain -exec singlenode}}


> Improve Left-Indexing Performance with (Nested) Parfor Loops
> ------------------------------------------------------------
>
>                 Key: SYSTEMML-633
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-633
>             Project: SystemML
>          Issue Type: Improvement
>            Reporter: Mike Dusenberry
>         Attachments: log.txt
>
>
> In the experimental deep learning DML library I've been building 
> ([https://github.com/dusenberrymw/systemml-nn|https://github.com/dusenberrymw/systemml-nn]),
>  I've experienced severe bottlenecks due to *left-indexing* in parfor loops.  
> Here, I will highlight a few particular instances with simplified examples, 
> but the same issue is shared across many areas of the library, particularly 
> in the convolution and max pooling layers, and is exaggerated in real 
> use-cases.
> *Quick note* on setup for any of the below experiments.  Please grab a copy 
> of the above repo (particularly the {{nn}} directory), and run any 
> experiments with the {{nn}} package available at the base directory of the 
> experiment.
> Scenario: *Convolution*
> * In the library above, the forward pass of the convolution function 
> ([{{conv::forward(...)}} | 
> https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/layers/conv.dml#L8]
>  in {{nn/layers/conv.dml}}) essentially accepts a matrix {{X}} of images, a 
> matrix of weights {{W}}, and several other parameters corresponding to image 
> sizes, filter sizes, etc.  It then loops through the images with a {{parfor}} 
> loop, and for each image it pads the image with {{util::pad_image}}, extracts 
> "patches" of the image into columns of a matrix in a sliding fashion across 
> the image with {{util::im2col}}, performs a matrix multiplication between the 
> matrix of patch columns and the weight matrix, and then saves the result into 
> a matrix defined outside of the parfor loop using left-indexing.
> * Left-indexing has been identified as the bottleneck by a wide margin.
> * Left-indexing is used in the main {{conv::forward(...)}} function in the 
> [last line in the parfor 
> loop|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/layers/conv.dml#L61],
>  in the 
> [{{util::pad_image(...)}}|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/util.dml#L196]
>  function used by {{conv::forward(...)}}, as well as in the 
> [{{util::im2col(...)}}|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/util.dml#L96]
>  function used by {{conv::forward(...)}}.
> * Test script (assuming the {{nn}} package is available):
> ** {{speed-633.dml}} {code}
> source("nn/layers/conv.dml") as conv
> source("nn/util.dml") as util
> # Generate data
> N = 64  # num examples
> C = 30  # num channels
> Hin = 28  # input height
> Win = 28  # input width
> F = 20  # num filters
> Hf = 3  # filter height
> Wf = 3  # filter width
> stride = 1
> pad = 1
> X = rand(rows=N, cols=C*Hin*Win)
> # Create layer
> [W, b] = conv::init(F, C, Hf, Wf)
> # Forward
> [out, Hout, Wout] = conv::forward(X, W, b, C, Hin, Win, Hf, Wf, stride, 
> stride, pad, pad)
> print("Out: " + nrow(out) + "x" + ncol(out))
> print("Hout: " + Hout)
> print("Wout: " + Wout)
> print("")
> print(sum(out))
> {code}
> * Invocation:
> ** {{java -jar 
> $SYSTEMML_HOME/target/systemml-0.10.0-incubating-SNAPSHOT-standalone.jar -f 
> speed-633.dml -stats -explain -exec singlenode}}
> * Stats output (modified to output up to 100 instructions):
> ** {code}
> ...
> Total elapsed time:   26.834 sec.
> Total compilation time:   0.529 sec.
> Total execution time:   26.304 sec.
> Number of compiled MR Jobs: 0.
> Number of executed MR Jobs: 0.
> Cache hits (Mem, WB, FS, HDFS): 9196235/0/0/0.
> Cache writes (WB, FS, HDFS):  3070724/0/0.
> Cache times (ACQr/m, RLS, EXP): 1.474/1.120/26.998/0.000 sec.
> HOP DAGs recompiled (PRED, SB): 0/0.
> HOP DAGs recompile time:  0.268 sec.
> Functions recompiled:   129.
> Functions recompile time: 0.841 sec.
> ParFor loops optimized:   1.
> ParFor optimize time:   0.032 sec.
> ParFor initialize time:   0.015 sec.
> ParFor result merge time: 0.028 sec.
> ParFor total update in-place: 0/0/1559360
> Total JIT compile time:   14.235 sec.
> Total JVM GC count:   94.
> Total JVM GC time:    0.366 sec.
> Heavy hitter instructions (name, time, count):
> -- 1)   leftIndex   41.670 sec  1559360
> -- 2)   forward   26.212 sec  1
> -- 3)   im2col_t45  25.919 sec  8
> -- 4)   im2col_t41  25.850 sec  8
> -- 5)   im2col_t48  25.831 sec  8
> -- 6)   im2col_t43  25.752 sec  8
> -- 7)   im2col_t44  25.736 sec  8
> -- 8)   im2col_t42  25.695 sec  8
> -- 9)   im2col_t47  25.691 sec  8
> -- 10)  im2col_t46  25.519 sec  8
> -- 11)  rangeReIndex  13.392 sec  3012544
> -- 12)  rshape  8.197 sec   3064704
> -- 13)  rmvar   4.988 sec   20363504
> -- 14)  createvar   4.954 sec   7688965
> -- 15)  ncol  1.148 sec   3014529
> -- 16)  -   0.961 sec   3112834
> -- 17)  +   0.878 sec   3124617
> -- 18)  rand  0.839 sec   52228
> -- 19)  *   0.480 sec   1764229
> -- 20)  cpvar   0.366 sec   1607812
> -- 21)  ba+*  0.257 sec   64
> -- 22)  pad_image_t42   0.187 sec   8
> -- 23)  pad_image_t47   0.181 sec   8
> -- 24)  pad_image_t44   0.168 sec   8
> -- 25)  pad_image_t46   0.164 sec   8
> -- 26)  pad_image_t43   0.156 sec   8
> -- 27)  pad_image_t48   0.153 sec   8
> -- 28)  pad_image_t45   0.152 sec   8
> -- 29)  pad_image_t41   0.152 sec   8
> -- 30)  nrow  0.036 sec   50307
> -- 31)  assignvar   0.016 sec   52235
> -- 32)  uak+  0.015 sec   1
> -- 33)  castvti   0.000 sec   130
> -- 34)  print   0.000 sec   5
> -- 35)  /   0.000 sec   130
> -- 36)  sqrt  0.000 sec   1
> {code}
> ** *Full log file attached* (including a {{log=DEBUG}} modification to the 
> parfor loop in {{conv::forward(...)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to