[ https://issues.apache.org/jira/browse/SYSTEMML-633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mike Dusenberry updated SYSTEMML-633: ------------------------------------- Description: In the experimental deep learning DML library I've been building ([https://github.com/dusenberrymw/systemml-nn|https://github.com/dusenberrymw/systemml-nn]), I've experienced severe bottlenecks due to *left-indexing* in parfor loops. Here, I will highlight a few particular instances with simplified examples, but the same issue is shared across many areas of the library, particularly in the convolution and max pooling layers, and is exaggerated in real use-cases. *Quick note* on setup for any of the below experiments. Please grab a copy of the above repo (particularly the {{nn}} directory), and run any experiments with the {{nn}} package available at the base directory of the experiment. Scenario: *Convolution* * In the library above, the forward pass of the convolution function ([{{conv::forward(...)}} | https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/layers/conv.dml#L8] in {{nn/layers/conv.dml}}) essentially accepts a matrix {{X}} of images, a matrix of weights {{W}}, and several other parameters corresponding to image sizes, filter sizes, etc. It then loops through the images with a {{parfor}} loop, and for each image it pads the image with {{util::pad_image}}, extracts "patches" of the image into columns of a matrix in a sliding fashion across the image with {{util::im2col}}, performs a matrix multiplication between the matrix of patch columns and the weight matrix, and then saves the result into a matrix defined outside of the parfor loop using left-indexing. * Left-indexing has been identified as the bottleneck by a wide margin. * Left-indexing is used in the main {{conv::forward(...)}} function in the [last line in the parfor loop|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/layers/conv.dml#L61], in the [{{util::pad_image(...)}}|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/util.dml#L196] function used by {{conv::forward(...)}}, as well as in the [{{util::im2col(...)}}|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/util.dml#L96] function used by {{conv::forward(...)}}. * Test script (assuming the {{nn}} package is available): ** {{speed-633.dml}} {code} source("nn/layers/conv.dml") as conv source("nn/util.dml") as util # Generate data N = 64 # num examples C = 30 # num channels Hin = 28 # input height Win = 28 # input width F = 20 # num filters Hf = 3 # filter height Wf = 3 # filter width stride = 1 pad = 1 X = rand(rows=N, cols=C*Hin*Win) # Create layer [W, b] = conv::init(F, C, Hf, Wf) # Forward [out, Hout, Wout] = conv::forward(X, W, b, C, Hin, Win, Hf, Wf, stride, stride, pad, pad) print("Out: " + nrow(out) + "x" + ncol(out)) print("Hout: " + Hout) print("Wout: " + Wout) print("") print(sum(out)) {code} * Invocation: ** {{java -jar $SYSTEMML_HOME/target/systemml-0.10.0-incubating-SNAPSHOT-standalone.jar -f speed-633.dml -stats -explain -exec singlenode}} was: In the experimental deep learning DML library I've been building ([https://github.com/dusenberrymw/systemml-nn|https://github.com/dusenberrymw/systemml-nn]), I've experienced severe bottlenecks due to *left-indexing* in parfor loops. Here, I will highlight a few particular instances with simplified examples, but the same issue is shared across many areas of the library, particularly in the convolution and max pooling layers, and is exaggerated in real use-cases. *Quick note* on setup for any of the below experiments. Please grab a copy of the above repo (particularly the {{nn}} directory), and run any experiments with the {{nn}} package available at the base directory of the experiment. Scenario: *Convolution* * In the library above, the forward pass of the convolution function ([{{conv::forward(...)}} | https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/layers/conv.dml#L8] in {{nn/layers/conv.dml}}) essentially accepts a matrix {{X}} of images, a matrix of weights {{W}}, and several other parameters corresponding to image sizes, filter sizes, etc. It then loops through the images with a {{parfor}} loop, and for each image it pads the image with {{util::pad_image}}, extracts "patches" of the image into columns of a matrix in a sliding fashion across the image with {{util::im2col}}, performs a matrix multiplication between the matrix of patch columns and the weight matrix, and then saves the result into a matrix defined outside of the parfor loop using left-indexing. * Left-indexing has been identified as the bottleneck by a wide margin. * Left-indexing is used in the main {{conv::forward(...)}} function in the [last line in the parfor loop|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/layers/conv.dml#L61], in the [{{util::pad_image(...)}}|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/util.dml#L196] function used by {{conv::forward(...)}}, as well as in the [{{util::im2col(...)}}|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/util.dml#L96] function used by {{conv::forward(...)}}. * Test script (assuming the {{nn}} package is available): ** {code} source("nn/layers/conv.dml") as conv source("nn/util.dml") as util # Generate data N = 64 # num examples C = 30 # num channels Hin = 28 # input height Win = 28 # input width F = 20 # num filters Hf = 3 # filter height Wf = 3 # filter width stride = 1 pad = 1 X = rand(rows=N, cols=C*Hin*Win) # Create layer [W, b] = conv::init(F, C, Hf, Wf) # Forward [out, Hout, Wout] = conv::forward(X, W, b, C, Hin, Win, Hf, Wf, stride, stride, pad, pad) print("Out: " + nrow(out) + "x" + ncol(out)) print("Hout: " + Hout) print("Wout: " + Wout) print("") print(sum(out)) {code} > Improve Left-Indexing Performance with (Nested) Parfor Loops > ------------------------------------------------------------ > > Key: SYSTEMML-633 > URL: https://issues.apache.org/jira/browse/SYSTEMML-633 > Project: SystemML > Issue Type: Improvement > Reporter: Mike Dusenberry > > In the experimental deep learning DML library I've been building > ([https://github.com/dusenberrymw/systemml-nn|https://github.com/dusenberrymw/systemml-nn]), > I've experienced severe bottlenecks due to *left-indexing* in parfor loops. > Here, I will highlight a few particular instances with simplified examples, > but the same issue is shared across many areas of the library, particularly > in the convolution and max pooling layers, and is exaggerated in real > use-cases. > *Quick note* on setup for any of the below experiments. Please grab a copy > of the above repo (particularly the {{nn}} directory), and run any > experiments with the {{nn}} package available at the base directory of the > experiment. > Scenario: *Convolution* > * In the library above, the forward pass of the convolution function > ([{{conv::forward(...)}} | > https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/layers/conv.dml#L8] > in {{nn/layers/conv.dml}}) essentially accepts a matrix {{X}} of images, a > matrix of weights {{W}}, and several other parameters corresponding to image > sizes, filter sizes, etc. It then loops through the images with a {{parfor}} > loop, and for each image it pads the image with {{util::pad_image}}, extracts > "patches" of the image into columns of a matrix in a sliding fashion across > the image with {{util::im2col}}, performs a matrix multiplication between the > matrix of patch columns and the weight matrix, and then saves the result into > a matrix defined outside of the parfor loop using left-indexing. > * Left-indexing has been identified as the bottleneck by a wide margin. > * Left-indexing is used in the main {{conv::forward(...)}} function in the > [last line in the parfor > loop|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/layers/conv.dml#L61], > in the > [{{util::pad_image(...)}}|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/util.dml#L196] > function used by {{conv::forward(...)}}, as well as in the > [{{util::im2col(...)}}|https://github.com/dusenberrymw/systemml-nn/blob/f6d3e077ae3c303eb8426b31329d3734e3483d5f/nn/util.dml#L96] > function used by {{conv::forward(...)}}. > * Test script (assuming the {{nn}} package is available): > ** {{speed-633.dml}} {code} > source("nn/layers/conv.dml") as conv > source("nn/util.dml") as util > # Generate data > N = 64 # num examples > C = 30 # num channels > Hin = 28 # input height > Win = 28 # input width > F = 20 # num filters > Hf = 3 # filter height > Wf = 3 # filter width > stride = 1 > pad = 1 > X = rand(rows=N, cols=C*Hin*Win) > # Create layer > [W, b] = conv::init(F, C, Hf, Wf) > # Forward > [out, Hout, Wout] = conv::forward(X, W, b, C, Hin, Win, Hf, Wf, stride, > stride, pad, pad) > print("Out: " + nrow(out) + "x" + ncol(out)) > print("Hout: " + Hout) > print("Wout: " + Wout) > print("") > print(sum(out)) > {code} > * Invocation: > ** {{java -jar > $SYSTEMML_HOME/target/systemml-0.10.0-incubating-SNAPSHOT-standalone.jar -f > speed-633.dml -stats -explain -exec singlenode}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)