[GitHub] [incubator-mxnet] QueensGambit edited a comment on issue #15640: Performance regression for MXNet 1.5.0

GitBox Tue, 23 Jul 2019 11:36:54 -0700

QueensGambit edited a comment on issue #15640: Performance regression for MXNet 
1.5.0
URL: 
https://github.com/apache/incubator-mxnet/issues/15640#issuecomment-514323333
 
 
   Hello @roywei. Thanks for your answer.
   It's an adapted convolutional network designed by myself which is in 
principal similar to the AlphaZero-Network, but specially performant for CPUs 
(about ~3x faster) and GPUs (about ~1.4x faster) while maintaining similar 
performance.
   
https://github.com/QueensGambit/CrazyAra/blob/master/DeepCrazyhouse/src/domain/neural_net/architectures/rise_v2_symbol.py#L149
   
   It applies the concepts of multiple recent papers in one network:
   
   * Separable depthwise convolutions with inverted residual blocks
   MobileNetV2: Inverted Residuals and Linear Bottlenecks - Sandler et al.
   https://arxiv.org/pdf/1801.04381.pdf
   
   * Increasing filter channels per layer
   Deep Pyramidal Residual Networks - Han et al.
   https://arxiv.org/abs/1610.02915
   
   * Enhancement of activation channels (Winning entry of ImageNet 2017)
   Squeeze-and-Excitation Networks - Hu et al.
   https://arxiv.org/abs/1709.01507
   
   Our paper which describes the network architecture in more detail will be 
published as well.
   
   ---
   
   The output `cp 71 depth 30 nodes 697` is UCI-protocol (UCI - Universal Chess 
Interface) related.
   * `cp` - Centi-Pawn metric how good the engine evaluates the given board 
position in the view of the current player. `cp 100` means +1 pawn advantage.
   * `nodes` - Number of evaluated board positions after running the search for 
X milli-seconds. Higher values are better
   * `depth` - Maximum depth a rollout of the MCTS search has reached during 
the search
   * `go movetime 5000` - Defines for how long in milli-seconds the engine 
should search the current position (by default the starting position)
   * `pv` - Stands for principal variation and is the best line of play which 
is suggested by the engine for both players
   ---
   I changed the profiling code by replacing the executor handle with a gluon 
network object.
   This version is slower, but provides more details on where the runtime is 
spent:
   ```python
   import mxnet as mx
   import numpy as np
   from mxnet import profiler
   
   sample = mx.ndarray.zeros((1, 34, 8, 8))
   it = int(1e4)
   ctx = mx.cpu()
   
   model_arch_path = 'model-1.19246-0.603-symbol.json'
   model_params_path = 'model-1.19246-0.603-0223.params'
   ctx = mx.cpu()
   symbol = mx.sym.load(model_arch_path)
   inputs = mx.sym.var('data', dtype='float32')
   value_out = symbol.get_internals()['value_tanh0_output']
   policy_out = symbol.get_internals()['flatten0_output']
   sym = mx.symbol.Group([value_out, policy_out])
   net = mx.gluon.SymbolBlock(sym, inputs)
   net.collect_params().load(model_params_path, ctx)
   
   def run_inference():
       out = net(sample)
       
   profiler.set_config(profile_all=True,
                       aggregate_stats=True,
                       filename='profile_output.json')
   
   # warm-up
   for i in range(100):
       run_inference()
       
   # Ask the profiler to start recording
   profiler.set_state('run')
   
   for i in range(it):
       run_inference()
   
   # Make sure all operations have completed
   mx.nd.waitall()
   # Ask the profiler to stop recording
   profiler.set_state('stop')
   # Dump all results to log file before download
   profiler.dump()
   ```
   
   ### MXNet 1.4.1
   ```python
   print(profiler.dumps())
   ```
   
   ```
   Profile Statistics.
        Note that counter items are counter values and not time units.
   Device Storage
   =================
   Name                          Total Count        Time (ms)    Min Time (ms)  
  Max Time (ms)    Avg Time (ms)
   ----                          -----------        ---------    -------------  
  -------------    -------------
   Memory: cpu/0                     1170861          13.5680           0.0000  
       905.4720         452.7360
   
   MXNET_C_API
   =================
   Name                          Total Count        Time (ms)    Min Time (ms)  
  Max Time (ms)    Avg Time (ms)
   ----                          -----------        ---------    -------------  
  -------------    -------------
   MXInvokeCachedOpEx                  10000           6.5960           0.0000  
         0.0140           0.0007
   MXNDArrayFree                       20000          13.3320           0.0000  
         0.0140           0.0007
   MXInvokeCachedOp                    10000        8961.5879           0.6330  
        10.6770           0.8962
   MXNDArrayWaitAll                        1      128982.7891      128982.7891  
    128982.7891      128982.7891
   MXNDArrayGetContext                 10000           6.3600           0.0000  
         0.0140           0.0006
   MXNet C API Concurrency            100002           0.0000           0.0000  
         0.0010           0.0005
   MXNet C API Calls                   50001          50.0010           0.0010  
        50.0010          25.0000
   
   operator
   =================
   Name                          Total Count        Time (ms)    Min Time (ms)  
  Max Time (ms)    Avg Time (ms)
   ----                          -----------        ---------    -------------  
  -------------    -------------
   Reorder                             42326          78.8710           0.0000  
         0.0980           0.0019
   DeleteVariable                    1312004        5104.0220           0.0000  
         5.2050           0.0039
   Activation                         820000       13786.3789           0.0060  
         2.6830           0.0168
   Convolution                        860000       87381.4688           0.0150  
        11.2850           0.1016
   Pooling                            100000        1611.6990           0.0100  
         0.7420           0.0161
   Reshape                            100000         296.2550           0.0010  
         0.0560           0.0030
   BatchNorm                          840000       21546.2461           0.0100  
         2.1610           0.0257
   broadcast_mul                      100000       13171.7432           0.1030  
         0.9910           0.1317
   broadcast_add                      260000       88638.4297           0.2000  
        13.5370           0.3409
   Flatten                            140000       10471.5117           0.0070  
         0.7100           0.0748
   FullyConnected                     240000       18911.3770           0.0160  
         0.7730           0.0788
   ```
   
   ### MXNet 1.5.0
   ```python
   print(profiler.dumps())
   ```
   ```
   Profile Statistics:
        Note the difference in units for different entries.
   Device Storage
   =================
   Name                          Total Count    Min Use  (kB)    Max Use  (kB)  
  Avg Use  (kB)
   ----                          -----------    -------------    -------------  
  -------------
   Memory: cpu/0                     1171335           0.0000         891.9040  
       445.9520
   
   MXNET_C_API
   =================
   Name                          Total Count        Time (ms)    Min Time (ms)  
  Max Time (ms)    Avg Time (ms)
   ----                          -----------        ---------    -------------  
  -------------    -------------
   MXNDArrayWaitAll                        1      129400.2969      129400.2969  
    129400.2969      129400.2969
   MXNet C API Calls                   50001          50.0010           0.0010  
        50.0010          25.0000
   MXInvokeCachedOp                    10000       13172.8516           0.7930  
        15.1500           1.3173
   MXNDArrayGetContext                 10000           7.4100           0.0000  
         0.0180           0.0007
   MXInvokeCachedOpEx                  10000           7.2500           0.0000  
         0.0180           0.0007
   MXNDArrayFree                       20000          14.4550           0.0000  
         0.0690           0.0007
   MXNet C API Concurrency            100002           0.0000           0.0000  
         0.0010           0.0000
   
   operator
   =================
   Name                          Total Count        Time (ms)    Min Time (ms)  
  Max Time (ms)    Avg Time (ms)
   ----                          -----------        ---------    -------------  
  -------------    -------------
   broadcast_add                      260000       91174.6172           0.2120  
         8.5150           0.3507
   broadcast_mul                      100000       13579.5371           0.1100  
         1.4710           0.1358
   Convolution                        860000       88691.5859           0.0160  
         9.0500           0.1031
   FullyConnected                     240000       20774.8105           0.0200  
         5.6790           0.0866
   Flatten                            140000       10516.3486           0.0060  
         0.7840           0.0751
   BatchNorm                          840000       23081.5391           0.0120  
        10.7910           0.0275
   Pooling                            100000        1740.1880           0.0110  
         0.4180           0.0174
   Activation                         820000       14118.7568           0.0070  
         5.6640           0.0172
   Reshape                            100000         537.9680           0.0020  
         0.0780           0.0054
   DeleteVariable                    1312516        5137.8071           0.0000  
         0.5320           0.0039
   Reorder                            286980         508.2870           0.0000  
         1.4380           0.0018
   ```
   The `Convolution` operation took longer for this model although it should 
have been faster according to some of your benchmarks. Did you change the 
version mkl version of which MXNet is linked to?
   This might also due to the usage of singular-depthwise convolutions.
   The `Activation`, `Batchnorm`,  `broadcast_add` operation also took more 
time.
   In general `broadcast_add` takes much longer as one might expect, here more 
time than `Convolution`.
   
   ---
   
   Surprisingly, there's not a regression for all models.
   For instance, the evaluated positions per second (nps), slightly increased 
for a model which is more similar to the traditional AlphaZero-Architecture 
`CrazyAraFish_0.5.0_RiseV1.zip` from release CrazyAra 0.5.0:
   
   ### MXNet 1.4.1
   ```python
   go movetime 10000
   ```
   ```python
   info score cp 289 depth 20 nodes 873 time 9751 nps 89 pv e2e4 e7e5 b1c3 b8c6 
g1f3 f8c5 f1c4 g8f6 d2d3 e8g8 e1g1 d7d6 c1g5 h7h6 g5h4 c8g4 c3d5 f6d5 h4d8 a8d8 
c4a6
   bestmove e2e4
   ```
   
   ### MXNet 1.5.0
   ```python
   go movetime 10000
   ```
   ```python
   info score cp 290 depth 20 nodes 889 time 9755 nps 91 pv e2e4 e7e5 b1c3 b8c6 
g1f3 f8c5 f1c4 g8f6 d2d3 e8g8 e1g1 d7d6 c1g5 h7h6 g5h4 c8g4 c3d5 f6d5 h4d8 a8d8 
c4a6
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] QueensGambit edited a comment on issue #15640: Performance regression for MXNet 1.5.0

Reply via email to