DickJC123 commented on a change in pull request #14200: Bulked op segments to allow Variable nodes URL: https://github.com/apache/incubator-mxnet/pull/14200#discussion_r259560832
########## File path: docs/faq/env_var.md ########## @@ -115,7 +115,13 @@ $env:MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0 - If set to `1`, during training MXNet executes the computation graph as several subgraphs in bulk mode. * MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN - Values: Int ```(default=15)``` - - The maximum number of nodes in the subgraph executed in bulk during training(not inference). Setting this to a larger number may reduce the degree of parallelism for multi-GPU training. + - The maximum number of nodes in the subgraph executed in bulk during training (not inference). Setting this to a larger number may reduce the degree of parallelism for multi-GPU training. Review comment: Not really. The thought was that forward() might be bulked to a very large extent, while backward might want smaller segments if that allows gradient reduction to start sooner in a multi-GPU scenario. Also, I prefer consistency and the forward_bulk_size and backward_bulk_size members of CachedOpConfig were added 8 months ago, seemingly as an invite to expand the functionality to support bulk size separately for forward() and backward(). To add some data to the discussion, I've created a unittest that I'll be adding to this PR that tests for the perf increase of bulking. It uses a symbolic model comprising a 1000 small GPU kernels. An adapted variant of that test produced the following stats: ``` fwd & bwd bulk seg size = 0, runtime = 32.9 msecs fwd & bwd bulk seg size = 1, runtime = 33.7 msecs fwd & bwd bulk seg size = 2, runtime = 24.8 msecs fwd & bwd bulk seg size = 3, runtime = 22.1 msecs fwd & bwd bulk seg size = 4, runtime = 20.7 msecs fwd & bwd bulk seg size = 5, runtime = 19.8 msecs fwd & bwd bulk seg size = 6, runtime = 19.6 msecs fwd & bwd bulk seg size = 7, runtime = 19.1 msecs fwd & bwd bulk seg size = 8, runtime = 18.8 msecs fwd & bwd bulk seg size = 9, runtime = 18.8 msecs fwd & bwd bulk seg size = 10, runtime = 18.5 msecs fwd & bwd bulk seg size = 11, runtime = 18.6 msecs fwd & bwd bulk seg size = 12, runtime = 18.1 msecs fwd & bwd bulk seg size = 13, runtime = 18.2 msecs fwd & bwd bulk seg size = 14, runtime = 17.9 msecs fwd & bwd bulk seg size = 15, runtime = 17.8 msecs fwd & bwd bulk seg size = 16, runtime = 17.6 msecs fwd & bwd bulk seg size = 17, runtime = 17.8 msecs fwd & bwd bulk seg size = 18, runtime = 18.0 msecs fwd & bwd bulk seg size = 19, runtime = 17.6 msecs fwd & bwd bulk seg size = 20, runtime = 18.1 msecs fwd & bwd bulk seg size = 21, runtime = 18.1 msecs fwd & bwd bulk seg size = 22, runtime = 17.7 msecs fwd & bwd bulk seg size = 23, runtime = 17.5 msecs fwd & bwd bulk seg size = 24, runtime = 16.2 msecs ``` It shows that there's a huge benefit for this launch-bound model of going from bulk_size=1 to bulk_size=2, with diminishing returns in the bulk_size=15 range. On the backward() pass of RN50, there are 2 conv weight gradients generated for every 7 nodes or so. That might suggest a small bulking for backward, although it really depends on whether the parameter server function is off the cycle-time critical path. Bottom line, if we don't have these knobs, it's hard to know if we've tuned the platform correctly for a given model. I like to think of the environment variables as split into two categories: every day ones, and those that one can safely ignore that are there for the experts to play with. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
