ElaineBao commented on issue #15640: Performance regression for MXNet 1.5.0
URL: 
https://github.com/apache/incubator-mxnet/issues/15640#issuecomment-514631220
 
 
   Hi, all, I've reproduced the case and confirmed initially that there is a 
regression in `Activation` operation for specific input shape. Here is the 
detailed report:
   
   (1) I run the code provided by @QueensGambit with `MKLDNN` backend (change 
`it`(iterations) in above script from 1e4 to 1e3, for efficiency), and got the 
profile statistics:
   
   ### mxnet 1.4.1
   
   ```
   Profile Statistics.
           Note that counter items are counter values and not time units.
   Device Storage
   =================
   Name                          Total Count        Time (ms)    Min Time (ms)  
  Max Time (ms)    Avg Time (ms)
   ----                          -----------        ---------    -------------  
  -------------    -------------
   Memory: cpu/0                      126343          13.5680           0.0000  
       905.4720         452.7360
   
   MXNET_C_API
   =================
   Name                          Total Count        Time (ms)    Min Time (ms)  
  Max Time (ms)    Avg Time (ms)
   ----                          -----------        ---------    -------------  
  -------------    -------------
   MXInvokeCachedOpEx                   1000           0.6340           0.0000  
         0.0020           0.0006
   MXNDArrayFree                        2000           1.2910           0.0000  
         0.0030           0.0006
   MXInvokeCachedOp                     1000        1432.7380           1.0730  
         3.3030           1.4327
   MXNDArrayWaitAll                        1       21019.3457       21019.3457  
     21019.3457       21019.3457
   MXNDArrayGetContext                  1000           0.7310           0.0000  
         0.0100           0.0007
   MXNet C API Concurrency             10002           0.0000           0.0000  
         0.0010           0.0005
   MXNet C API Calls                    5001           5.0010           0.0010  
         5.0010           2.5000
   
   operator
   =================
   Name                          Total Count        Time (ms)    Min Time (ms)  
  Max Time (ms)    Avg Time (ms)
   ----                          -----------        ---------    -------------  
  -------------    -------------
   FullyConnected                      24000        4591.6479           0.0310  
         0.5470           0.1913
   Flatten                             14000        1809.8690           0.0090  
         0.2310           0.1293
   broadcast_add                       26000       15939.2910           0.4090  
         1.7990           0.6130
   BatchNorm                           84000        3556.9929           0.0190  
         0.9100           0.0423
   broadcast_mul                       10000        2011.7510           0.1820  
         0.8450           0.2012
   Reshape                             10000          38.4070           0.0020  
         0.0260           0.0038
   Pooling                             10000         344.1310           0.0180  
         0.6350           0.0344
   Convolution                         86000        7773.5220           0.0260  
         1.1880           0.0904
   Reorder                              4716          11.4720           0.0010  
         0.0150           0.0024
   Activation                          82000        2387.3359           0.0140  
         0.6090           0.0291
   DeleteVariable                     141422         598.9450           0.0000  
         0.5270           0.0042
   ```
   ### mxnet 1.5.0
   ```
   Profile Statistics:
           Note the difference in units for different entries.
   Device Storage
   =================
   Name                          Total Count    Min Use  (kB)    Max Use  (kB)  
  Avg Use  (kB)
   ----                          -----------    -------------    -------------  
  -------------
   Memory: cpu/0                      127053           0.0000         891.9040  
       445.9520
   
   MXNET_C_API
   =================
   Name                          Total Count        Time (ms)    Min Time (ms)  
  Max Time (ms)    Avg Time (ms)
   ----                          -----------        ---------    -------------  
  -------------    -------------
   MXNDArrayWaitAll                        1       23634.1523       23634.1523  
     23634.1523       23634.1523
   MXNet C API Calls                    5001           5.0010           0.0010  
         5.0010           2.5000
   MXInvokeCachedOp                     1000        1632.0590           1.0620  
         2.6950           1.6321
   MXNDArrayGetContext                  1000           0.7340           0.0000  
         0.0120           0.0007
   MXNDArrayFree                        2000           1.4440           0.0000  
         0.0100           0.0007
   MXInvokeCachedOpEx                   1000           0.7130           0.0000  
         0.0020           0.0007
   MXNet C API Concurrency             10002           0.0000           0.0000  
         0.0010           0.0000
   
   operator
   =================
   Name                          Total Count        Time (ms)    Min Time (ms)  
  Max Time (ms)    Avg Time (ms)
   ----                          -----------        ---------    -------------  
  -------------    -------------
   broadcast_add                       26000       17322.6543           0.4180  
         1.9220           0.6663
   broadcast_mul                       10000        2184.6421           0.1850  
         0.7510           0.2185
   FullyConnected                      24000        4494.6572           0.0310  
         0.8050           0.1873
   Flatten                             14000        1890.4550           0.0090  
         0.2690           0.1350
   Convolution                         86000        7898.2749           0.0260  
         1.3600           0.0918
   Activation                          82000        4982.0112           0.0170  
         0.7210           0.0608
   BatchNorm                           84000        3896.8450           0.0210  
         0.7810           0.0464
   Pooling                             10000         315.3780           0.0170  
         0.6760           0.0315
   Reshape                             10000          68.2620           0.0040  
         0.0230           0.0068
   DeleteVariable                     142202         659.8880           0.0000  
         0.0680           0.0046
   Reorder                             32420          53.1210           0.0000  
         0.0180           0.0016
   ```
   
   Noticed from the table that `Activation` operation has a performance gap 
between 1.4.1 (avg time: 0.0291 ms) and 1.5.0 (avg time: 0.0608 ms). The 
performance of other operations are considered as normal fluctuation, 
concerning that the profiler code changed between 1.4.1 and 1.5.0.
   
   (2) to further investigate the performance regression of `Activation` 
operation, I enabled MKLDNN verbose by `export MKLDNN_VERBOSE=1`, which helps 
to collect basic statistics like execution time and input/output shapes of each 
MKLDNN operation. Parsing the verbose result and computing the average time of 
operation with different input shapes:
   
   ### mxnet 1.4.1
   ```
   op: relu,
   shape:all, total_count:38500, total_time:386.78921302000026, 
avg_time:0.010046473065454552
   shape:mb1ic128, total_count:5500, total_time:43.13965790000107, 
avg_time:0.007843574163636558
   shape:mb1ic256ih8iw8, total_count:4400, total_time:41.59277998999979, 
avg_time:0.00945290454318177
   shape:mb1ic128ih8iw8, total_count:2200, total_time:19.7434104399999, 
avg_time:0.008974277472727226
   ......
   
   op: sigmoid, 
   shape:all, total_count:5500, total_time:56.511490069998786, 
avg_time:0.010274816376363415
   shape:mb1ic256, total_count:5500, total_time:56.511490069998786, 
avg_time:0.010274816376363415
   ```
   ### mxnet 1.5.0
   ```
   op: relu,
   shape: all, total_count:38500, total_time:1070.7652004499996, 
avg_time:0.02781208312857142
   shape:mb1ic128, total_count:5500, total_time:690.7533760000006, 
avg_time:0.125591522909091
   shape:mb1ic256ih8iw8, total_count:4400, total_time:38.590089899999874, 
avg_time:0.008770474977272699
   shape:mb1ic128ih8iw8, total_count:2200, total_time:18.308107539999845, 
avg_time:0.008321867063636293
   ......
   
   op:sigmoid, 
   shape: all, total_count:5500, total_time:680.288773400003, 
avg_time:0.12368886789090965
   shape:mb1ic256, total_count:5500, total_time:680.288773400003, 
avg_time:0.12368886789090965
   ```
   Noticed from the table that for `relu`, shape of `mb1ic128` (stands for 
batch size=1, input channel=128) has a performance regression (avg_time from 
0.0078 to 0.1256), other shapes are almost the same between 1.4.1 and 1.5.0. 
And for `sigmoid`, shape of `mb1ic256` has a performance regression (avg_time 
from 0.0103 to 0.1237).
   
   On the other hand, as @QueensGambit mentioned that there's not a regression 
for all models, I think the model in CrazyAraFish_0.5.0_RiseV1.zip may have 
fewer these operations, so that the performance can be explained as above.
   
   In conclusion, MKLDNN `relu` with shape `mb1ic128` and `sigmoid` with shape 
`mb1ic256` have a performance regression from 1.4.1 to 1.5.0. We'll look into 
it.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to