QueensGambit edited a comment on issue #15640: Performance regression for MXNet 1.5.0 URL: https://github.com/apache/incubator-mxnet/issues/15640#issuecomment-514323333 Hello @roywei. Thanks for your answer. It's an adapted convolutional network designed by myself which is in principal similar to the AlphaZero-Network, but specially performant for CPUs (about ~3x faster) and GPUs (about ~1.4x faster) while maintaining similar performance. https://github.com/QueensGambit/CrazyAra/blob/master/DeepCrazyhouse/src/domain/neural_net/architectures/rise_v2_symbol.py#L149 It applies the concepts of multiple recent papers in one network: * Separable depthwise convolutions with inverted residual blocks MobileNetV2: Inverted Residuals and Linear Bottlenecks - Sandler et al. https://arxiv.org/pdf/1801.04381.pdf * Increasing filter channels per layer Deep Pyramidal Residual Networks - Han et al. https://arxiv.org/abs/1610.02915 * Enhancement of activation channels (Winning entry of ImageNet 2017) Squeeze-and-Excitation Networks - Hu et al. https://arxiv.org/abs/1709.01507 Our paper which describes the network architecture in more detail will be published as well. --- The output `cp 71 depth 30 nodes 697` is UCI-protocol (UCI - Universal Chess Interface) related. * `cp` - Centi-Pawn metric how good the engine evaluates the given board position in the view of the current player. `cp 100` means +1 pawn advantage. * `nodes` - Number of evaluated board positions after running the search for X milli-seconds. Higher values are better * `depth` - Maximum depth a rollout of the MCTS search has reached during the search * `go movetime 5000` - Defines for how long in milli-seconds the engine should search the current position (by default the starting position) --- I changed the profiling code by replacing the executor handle with a gluon network object. This version is slower, but provides more details on where the runtime is spent: ```python import mxnet as mx import numpy as np from mxnet import profiler sample = mx.ndarray.zeros((1, 34, 8, 8)) it = int(1e4) ctx = mx.cpu() model_arch_path = 'model-1.19246-0.603-symbol.json' model_params_path = 'model-1.19246-0.603-0223.params' ctx = mx.cpu() symbol = mx.sym.load(model_arch_path) inputs = mx.sym.var('data', dtype='float32') value_out = symbol.get_internals()['value_tanh0_output'] policy_out = symbol.get_internals()['flatten0_output'] sym = mx.symbol.Group([value_out, policy_out]) net = mx.gluon.SymbolBlock(sym, inputs) net.collect_params().load(model_params_path, ctx) def run_inference(): out = net(sample) profiler.set_config(profile_all=True, aggregate_stats=True, filename='profile_output.json') # warm-up for i in range(100): run_inference() # Ask the profiler to start recording profiler.set_state('run') for i in range(it): run_inference() # Make sure all operations have completed mx.nd.waitall() # Ask the profiler to stop recording profiler.set_state('stop') # Dump all results to log file before download profiler.dump() ``` ### MXNet 1.4.1 ```python print(profiler.dumps()) ``` ```python Profile Statistics. Note that counter items are counter values and not time units. Device Storage ================= Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms) ---- ----------- --------- ------------- ------------- ------------- Memory: cpu/0 1170861 13.5680 0.0000 905.4720 452.7360 MXNET_C_API ================= Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms) ---- ----------- --------- ------------- ------------- ------------- MXInvokeCachedOpEx 10000 6.5960 0.0000 0.0140 0.0007 MXNDArrayFree 20000 13.3320 0.0000 0.0140 0.0007 MXInvokeCachedOp 10000 8961.5879 0.6330 10.6770 0.8962 MXNDArrayWaitAll 1 128982.7891 128982.7891 128982.7891 128982.7891 MXNDArrayGetContext 10000 6.3600 0.0000 0.0140 0.0006 MXNet C API Concurrency 100002 0.0000 0.0000 0.0010 0.0005 MXNet C API Calls 50001 50.0010 0.0010 50.0010 25.0000 operator ================= Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms) ---- ----------- --------- ------------- ------------- ------------- Reorder 42326 78.8710 0.0000 0.0980 0.0019 DeleteVariable 1312004 5104.0220 0.0000 5.2050 0.0039 Activation 820000 13786.3789 0.0060 2.6830 0.0168 Convolution 860000 87381.4688 0.0150 11.2850 0.1016 Pooling 100000 1611.6990 0.0100 0.7420 0.0161 Reshape 100000 296.2550 0.0010 0.0560 0.0030 BatchNorm 840000 21546.2461 0.0100 2.1610 0.0257 broadcast_mul 100000 13171.7432 0.1030 0.9910 0.1317 broadcast_add 260000 88638.4297 0.2000 13.5370 0.3409 Flatten 140000 10471.5117 0.0070 0.7100 0.0748 FullyConnected 240000 18911.3770 0.0160 0.7730 0.0788 ``` ### MXNet 1.5.0 ```python print(profiler.dumps()) ``` ```python Profile Statistics: Note the difference in units for different entries. Device Storage ================= Name Total Count Min Use (kB) Max Use (kB) Avg Use (kB) ---- ----------- ------------- ------------- ------------- Memory: cpu/0 1171335 0.0000 891.9040 445.9520 MXNET_C_API ================= Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms) ---- ----------- --------- ------------- ------------- ------------- MXNDArrayWaitAll 1 129400.2969 129400.2969 129400.2969 129400.2969 MXNet C API Calls 50001 50.0010 0.0010 50.0010 25.0000 MXInvokeCachedOp 10000 13172.8516 0.7930 15.1500 1.3173 MXNDArrayGetContext 10000 7.4100 0.0000 0.0180 0.0007 MXInvokeCachedOpEx 10000 7.2500 0.0000 0.0180 0.0007 MXNDArrayFree 20000 14.4550 0.0000 0.0690 0.0007 MXNet C API Concurrency 100002 0.0000 0.0000 0.0010 0.0000 operator ================= Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms) ---- ----------- --------- ------------- ------------- ------------- broadcast_add 260000 91174.6172 0.2120 8.5150 0.3507 broadcast_mul 100000 13579.5371 0.1100 1.4710 0.1358 Convolution 860000 88691.5859 0.0160 9.0500 0.1031 FullyConnected 240000 20774.8105 0.0200 5.6790 0.0866 Flatten 140000 10516.3486 0.0060 0.7840 0.0751 BatchNorm 840000 23081.5391 0.0120 10.7910 0.0275 Pooling 100000 1740.1880 0.0110 0.4180 0.0174 Activation 820000 14118.7568 0.0070 5.6640 0.0172 Reshape 100000 537.9680 0.0020 0.0780 0.0054 DeleteVariable 1312516 5137.8071 0.0000 0.5320 0.0039 Reorder 286980 508.2870 0.0000 1.4380 0.0018 ``` The Convolution operation took longer for this model although it should have been faster according to some of your benchmarks. Did you change the version mkl version of which MXNet is linked to? This might also due to the usage of singular-depthwise convolutions. The `Activation`, `Batchnorm`, `broadcast_add` operation also took more time. In general `broadcast_add` takes much longer as one might expect, here more time than `Convolution`. --- Surprisingly, there's not a regression for all models. For instance, the evaluated positions per second (nps), slightly increased for a model which is more similar to the traditional AlphaZero-Architecture `CrazyAraFish_0.5.0_RiseV1.zip` from release CrazyAra 0.5.0: ### MXNet 1.4.1 ```python go movetime 10000 ``` ```python info score cp 289 depth 20 nodes 873 time 9751 nps 89 pv e2e4 e7e5 b1c3 b8c6 g1f3 f8c5 f1c4 g8f6 d2d3 e8g8 e1g1 d7d6 c1g5 h7h6 g5h4 c8g4 c3d5 f6d5 h4d8 a8d8 c4a6 bestmove e2e4 ``` ### MXNet 1.5.0 ```python go movetime 10000 ``` ```python info score cp 290 depth 20 nodes 889 time 9755 nps 91 pv e2e4 e7e5 b1c3 b8c6 g1f3 f8c5 f1c4 g8f6 d2d3 e8g8 e1g1 d7d6 c1g5 h7h6 g5h4 c8g4 c3d5 f6d5 h4d8 a8d8 c4a6 ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
