[GitHub] DickJC123 commented on a change in pull request #14200: Bulked op segments to allow Variable nodes

GitBox Fri, 22 Feb 2019 18:19:55 -0800

DickJC123 commented on a change in pull request #14200: Bulked op segments to 
allow Variable nodes
URL: https://github.com/apache/incubator-mxnet/pull/14200#discussion_r259560832


 ##########
 File path: docs/faq/env_var.md
 ##########
 @@ -115,7 +115,13 @@ $env:MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0
   - If set to `1`, during training MXNet executes the computation graph as 
several subgraphs in bulk mode.
 * MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN
   - Values: Int ```(default=15)```
-  - The maximum number of nodes in the subgraph executed in bulk during 
training(not inference). Setting this to a larger number may reduce the degree 
of parallelism for multi-GPU training.
+  - The maximum number of nodes in the subgraph executed in bulk during 
training (not inference). Setting this to a larger number may reduce the degree 
of parallelism for multi-GPU training.
 
 Review comment:
   Not really.  The thought was that forward() might be bulked to a very large 
extent, while backward might want smaller segments if that allows gradient 
reduction to start sooner in a multi-GPU scenario.
   
   Also, I prefer consistency and the forward_bulk_size and backward_bulk_size 
members of CachedOpConfig were added 8 months ago, seemingly as an invite to 
expand the functionality to support bulk size separately for forward() and 
backward().
   
   To add some data to the discussion, I've created a unittest that I'll be 
adding to this PR that tests for the perf increase of bulking.  It uses a 
symbolic model comprising a 1000 small GPU kernels.  An adapted variant of that 
test produced the following stats:
   ```
   fwd & bwd bulk seg size = 0,    runtime = 32.9 msecs
   fwd & bwd bulk seg size = 1,    runtime = 33.7 msecs
   fwd & bwd bulk seg size = 2,    runtime = 24.8 msecs
   fwd & bwd bulk seg size = 3,    runtime = 22.1 msecs
   fwd & bwd bulk seg size = 4,    runtime = 20.7 msecs
   fwd & bwd bulk seg size = 5,    runtime = 19.8 msecs
   fwd & bwd bulk seg size = 6,    runtime = 19.6 msecs
   fwd & bwd bulk seg size = 7,    runtime = 19.1 msecs
   fwd & bwd bulk seg size = 8,    runtime = 18.8 msecs
   fwd & bwd bulk seg size = 9,    runtime = 18.8 msecs
   fwd & bwd bulk seg size = 10,   runtime = 18.5 msecs
   fwd & bwd bulk seg size = 11,   runtime = 18.6 msecs
   fwd & bwd bulk seg size = 12,   runtime = 18.1 msecs
   fwd & bwd bulk seg size = 13,   runtime = 18.2 msecs
   fwd & bwd bulk seg size = 14,   runtime = 17.9 msecs
   fwd & bwd bulk seg size = 15,   runtime = 17.8 msecs
   fwd & bwd bulk seg size = 16,   runtime = 17.6 msecs
   fwd & bwd bulk seg size = 17,   runtime = 17.8 msecs
   fwd & bwd bulk seg size = 18,   runtime = 18.0 msecs
   fwd & bwd bulk seg size = 19,   runtime = 17.6 msecs
   fwd & bwd bulk seg size = 20,   runtime = 18.1 msecs
   fwd & bwd bulk seg size = 21,   runtime = 18.1 msecs
   fwd & bwd bulk seg size = 22,   runtime = 17.7 msecs
   fwd & bwd bulk seg size = 23,   runtime = 17.5 msecs
   fwd & bwd bulk seg size = 24,   runtime = 16.2 msecs
   ```
   It shows that there's a huge benefit for this launch-bound model of going 
from bulk_size=1 to bulk_size=2, with diminishing returns in the bulk_size=15 
range.  On the backward() pass of RN50, there are 2 conv weight gradients 
generated for every 7 nodes or so.  That might suggest a small bulking for 
backward, although it really depends on whether the parameter server function 
is off the cycle-time critical path.
   
   Bottom line, if we don't have these knobs, it's hard to know if we've tuned 
the platform correctly for a given model.  I like to think of the environment 
variables as split into two categories: every day ones, and those that one can 
safely ignore that are there for the experts to play with.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] DickJC123 commented on a change in pull request #14200: Bulked op segments to allow Variable nodes

Reply via email to