XiaotaoChen opened a new pull request #13150: support mkl log when dtype is 
fp32 or fp64
URL: https://github.com/apache/incubator-mxnet/pull/13150
 
 
   
   
   ## Description ##
   
   1. support MKL Log implemented with vectorization;
   2. when data's type is fp32 or fp64 and req is `kWriteTo` , `mx.nd.log` will 
call MKL Log.  otherwise, it will call the default implementation.
   3. **According to the statistics in c code, it gains 2-10 times speedup with 
different size data**. in python layer, it can gain 1.2-4.4 times speedup. 
detail data as below.
   4. **speedup in [Sockeye](https://github.com/awslabs/sockeye) model : 
according profile info: `nd.log` gains 2.67 times speedup compared with the 
default implementation. the total time of log operator dropped from 11.5s to 
4.3s**.  profile info as below. 
   
   @pengzhao-intel
   
   ##  detail data
   
   **hardware: skx-8180 single socket(28 cores)**
   
   #### statistics in c++ code
   
   | shape          | size      | float32(us)      |                 |          
   | double(us)       |                 |             |
   | -------------- | --------- | ---------------- | --------------- | 
----------- | ---------------- | --------------- | ----------- |
   |                |           | default-cal-time | MKLLog-cal-time | 
cal-speedup | default-cal-time | MKLLog-cal-time | cal-speedup |
   | (1000, 1)      | 1000      | 6.02333          | 0.7816          | 
7.706409928 | 7.00213          | 1.58913         | 4.406266322 |
   | (1000, 10)     | 10000     | 9.4944           | 4.0589          | 
2.339155929 | 18.8873          | 10.0668         | 1.876197004 |
   | (1000, 100)    | 100000    | 47.3687          | 35.2625         | 
1.343316554 | 137.633          | 30.9184         | 4.451491668 |
   | (1000, 1000)   | 1000000   | 390.641          | 64.6105         | 
6.04609158  | 1309             | 127.027         | 10.30489581 |
   | (1000, 10000)  | 10000000  | 4516.87          | 1601.08         | 
2.821139481 | 13872.1          | 3071.23         | 4.51678969  |
   | (1000, 100000) | 100000000 | 41938.7          | 15814.1         | 
2.65198146  | 135445           | 30538.2         | 4.435264685 |
   
   #### statistics in python front end
   
   here is [test 
script](https://github.com/XiaotaoChen/incubator-mxnet/blob/cxt-test/tests/cxt-test/mkl-log/test_mkl_log.py)
 
   
   | shape          | size      | float32(us)           |                      
|                  | double(us)            |                      |             
     |
   | -------------- | --------- | --------------------- | -------------------- 
| ---------------- | --------------------- | -------------------- | 
---------------- |
   |                |           | default-callback-time | MKLLog-callback-time 
| callback-speedup | default-callback-time | MKLLog-callback-time | 
callback-speedup |
   | (1000, 1)      | 1000      | 40.650368             | 31.105677            
| 1.306847236      | 40.729841             | 31.542778            | 1.291257257 
     |
   | (1000, 10)     | 10000     | 43.606758             | 34.761429            
| 1.254458152      | 53.819021             | 41.135152            | 1.308346229 
     |
   | (1000, 100)    | 100000    | 119.781494            | 76.246262            
| 1.570981854      | 203.490257            | 77.493985            | 2.625884538 
     |
   | (1000, 1000)   | 1000000   | 1453.781128           | 1190.233231          
| 1.221425423      | 2601.218224           | 1967.120171          | 1.322348407 
     |
   | (1000, 10000)  | 10000000  | 8755.636215           | 8626.850446          
| 1.014928481      | 14128.685             | 8942.484856          | 1.579950677 
     |
   | (1000, 100000) | 100000000 | 42178.8613            | 16119.05098          
| 2.616708723      | 135685.5551           | 30782.89032          | 4.407823753 
     |
   
   
   
   ### sockeye profile on skx-8180 single socket
   
   **offical master**
   
   ```shell
   Time of each OP:
   FullyConnected         47266.636 ms  0.77306329528     ms/call       61142  
calls    27.81 %
   CopyCPU2CPU            23471.975 ms  0.640069128194    ms/call       36671  
calls    13.81 %
   SliceChannel           18618.516 ms  0.611847387447    ms/call       30430  
calls    10.96 %
   Reshape                17875.549 ms  2.03895848067     ms/call       8767   
calls    10.52 %
   where                  12352.738 ms  2.36099732416     ms/call       5232   
calls    7.27 %
   log                    11476.426 ms  6.58051949541     ms/call       1744   
calls    6.75 %
   take                   6019.863  ms  0.153309810014    ms/call       39266  
calls    3.54 %
   softmax                4899.333  ms  1.4046252867      ms/call       3488   
calls    2.88 %
   Activation             4117.32   ms  0.0284441558262   ms/call       144751 
calls    2.42 %
   _mul_scalar            4105.616  ms  2.35413761468     ms/call       1744   
calls    2.42 %
   elemwise_add           3983.712  ms  0.0659664182812   ms/call       60390  
calls    2.34 %
   Concat                 3906.116  ms  0.22158588609     ms/call       17628  
calls    2.30 %
   broadcast_add          2765.45   ms  1.5440815187      ms/call       1791   
calls    1.63 %
   batch_dot              2552.298  ms  0.731736811927    ms/call       3488   
calls    1.50 %
   DeleteVariable         1857.73   ms  0.0294401128332   ms/call       63102  
calls    1.09 %
   LayerNorm              1476.755  ms  0.591648637821    ms/call       2496   
calls    0.87 %
   repeat                 1177.079  ms  1.19258257345     ms/call       987    
calls    0.69 %
   elemwise_mul           1133.412  ms  0.0132791114548   ms/call       85353  
calls    0.67 %
   _slice_assign          364.709   ms  0.121246343085    ms/call       3008   
calls    0.21 %
   SetupExec              70.361    ms  0.000782510537496 ms/call       89917  
calls    0.04 %
   Dropout                57.91     ms  0.0163818953324   ms/call       3535   
calls    0.03 %
   Embedding              52.36     ms  0.0292350642099   ms/call       1791   
calls    0.03 %
   _full                  52.006    ms  0.0282948857454   ms/call       1838   
calls    0.03 %
   sum                    41.332    ms  0.0230776102736   ms/call       1791   
calls    0.02 %
   stack                  37.153    ms  0.263496453901    ms/call       141    
calls    0.02 %
   SequenceMask           34.81     ms  0.0199598623853   ms/call       1744   
calls    0.02 %
   expand_dims            22.422    ms  0.00267949330784  ms/call       8368   
calls    0.01 %
   _equal_scalar          21.718    ms  0.00311324541284  ms/call       6976   
calls    0.01 %
   WaitForVar             20.932    ms  0.00427969740339  ms/call       4891   
calls    0.01 %
   broadcast_logical_or   17.47     ms  0.00500860091743  ms/call       3488   
calls    0.01 %
   broadcast_logical_and  11.335    ms  0.0064994266055   ms/call       1744   
calls    0.01 %
   _zeros                 10.968    ms  0.0101461609621   ms/call       1081   
calls    0.01 %
   SwapAxis               10.414    ms  0.110787234043    ms/call       94     
calls    0.01 %
   _greater_equal         9.397     ms  0.00538818807339  ms/call       1744   
calls    0.01 %
   broadcast_logical_xor  9.289     ms  0.00532626146789  ms/call       1744   
calls    0.01 %
   elemwise_div           8.07      ms  0.00462729357798  ms/call       1744   
calls    0.00 %
   logical_not            7.188     ms  0.00412155963303  ms/call       1744   
calls    0.00 %
   SequenceReverse        6.896     ms  0.0733617021277   ms/call       94     
calls    0.00 %
   _div_scalar            6.207     ms  0.00355905963303  ms/call       1744   
calls    0.00 %
   tile                   3.75      ms  0.0797872340426   ms/call       47     
calls    0.00 %
   SequenceLast           1.559     ms  0.033170212766    ms/call       47     
calls    0.00 %
   argsort                0.786     ms  0.0167234042553   ms/call       47     
calls    0.00 %
   broadcast_to           0.763     ms  0.0162340425532   ms/call       47     
calls    0.00 %
   broadcast_not_equal    0.513     ms  0.010914893617    ms/call       47     
calls    0.00 %
   _slice_assign_scalar   0.423     ms  0.009             ms/call       47     
calls    0.00 %
   _ones                  0.346     ms  0.00736170212766  ms/call       47     
calls    0.00 %
   _unravel_index         0.333     ms  0.00708510638298  ms/call       47     
calls    0.00 %
   Cast                   0.254     ms  0.00540425531915  ms/call       47     
calls    0.00 %
   zeros_like             0.229     ms  0.00243617021277  ms/call       94     
calls    0.00 %
   
   Total OP Time: 169938.42700000 ms
   ```
   
   **mkl log**
   
   ```shell
   Time of each OP:
   FullyConnected         48077.779 ms     0.786329838736    ms/call       
61142  calls    28.44 %
   CopyCPU2CPU            23618.186 ms     0.644056229718    ms/call       
36671  calls    13.97 %
   SliceChannel           19328.791 ms     0.635188662504    ms/call       
30430  calls    11.43 %
   Reshape                17722.157 ms     2.02146195962     ms/call       8767 
  calls    10.48 %
   where                  12821.864 ms     2.45066207951     ms/call       5232 
  calls    7.58 %
   take                   9492.064  ms     0.24173748281     ms/call       
39266  calls    5.61 %
   softmax                5048.474  ms     1.44738360092     ms/call       3488 
  calls    2.99 %
   Activation             4407.896  ms     0.0304515754641   ms/call       
144751 calls    2.61 %
   _mul_scalar            4269.909  ms     2.44834231651     ms/call       1744 
  calls    2.53 %
   log                    4144.673  ms     2.37653268349     ms/call       1744 
  calls    2.45 %
   elemwise_add           4128.241  ms     0.0683596787548   ms/call       
60390  calls    2.44 %
   Concat                 4081.049  ms     0.231509473565    ms/call       
17628  calls    2.41 %
   broadcast_add          2975.819  ms     1.66154048018     ms/call       1791 
  calls    1.76 %
   batch_dot              2289.026  ms     0.656257454128    ms/call       3488 
  calls    1.35 %
   DeleteVariable         1865.181  ms     0.0295567863085   ms/call       
63105  calls    1.10 %
   LayerNorm              1521.639  ms     0.609631009615    ms/call       2496 
  calls    0.90 %
   elemwise_mul           1155.257  ms     0.013535048563    ms/call       
85353  calls    0.68 %
   repeat                 1141.611  ms     1.15664741641     ms/call       987  
  calls    0.68 %
   _slice_assign          410.251   ms     0.136386635638    ms/call       3008 
  calls    0.24 %
   Embedding              103.998   ms     0.058067001675    ms/call       1791 
  calls    0.06 %
   SetupExec              73.053    ms     0.000812449258761 ms/call       
89917  calls    0.04 %
   Dropout                61.236    ms     0.0173227722772   ms/call       3535 
  calls    0.04 %
   _full                  51.833    ms     0.0282007616975   ms/call       1838 
  calls    0.03 %
   sum                    39.877    ms     0.0222652149637   ms/call       1791 
  calls    0.02 %
   stack                  36.832    ms     0.261219858156    ms/call       141  
  calls    0.02 %
   SequenceMask           36.207    ms     0.0207608944954   ms/call       1744 
  calls    0.02 %
   expand_dims            22.735    ms     0.00271689770554  ms/call       8368 
  calls    0.01 %
   _equal_scalar          22.159    ms     0.00317646215596  ms/call       6976 
  calls    0.01 %
   WaitForVar             19.248    ms     0.00393378295524  ms/call       4893 
  calls    0.01 %
   broadcast_logical_or   18.373    ms     0.00526748853211  ms/call       3488 
  calls    0.01 %
   broadcast_logical_and  11.88     ms     0.0068119266055   ms/call       1744 
  calls    0.01 %
   SwapAxis               10.695    ms     0.113776595745    ms/call       94   
  calls    0.01 %
   _zeros                 10.567    ms     0.00977520814061  ms/call       1081 
  calls    0.01 %
   _greater_equal         9.364     ms     0.00536926605505  ms/call       1744 
  calls    0.01 %
   broadcast_logical_xor  9.056     ms     0.00519266055046  ms/call       1744 
  calls    0.01 %
   elemwise_div           7.75      ms     0.00444380733945  ms/call       1744 
  calls    0.00 %
   logical_not            7.266     ms     0.00416628440367  ms/call       1744 
  calls    0.00 %
   SequenceReverse        7.196     ms     0.0765531914894   ms/call       94   
  calls    0.00 %
   _div_scalar            6.596     ms     0.00378211009174  ms/call       1744 
  calls    0.00 %
   tile                   3.689     ms     0.0784893617021   ms/call       47   
  calls    0.00 %
   SequenceLast           1.646     ms     0.0350212765957   ms/call       47   
  calls    0.00 %
   _slice_assign_scalar   1.079     ms     0.0229574468085   ms/call       47   
  calls    0.00 %
   broadcast_to           1.013     ms     0.0215531914894   ms/call       47   
  calls    0.00 %
   argsort                0.761     ms     0.0161914893617   ms/call       47   
  calls    0.00 %
   broadcast_not_equal    0.547     ms     0.0116382978723   ms/call       47   
  calls    0.00 %
   _ones                  0.395     ms     0.00840425531915  ms/call       47   
  calls    0.00 %
   _unravel_index         0.327     ms     0.00695744680851  ms/call       47   
  calls    0.00 %
   Cast                   0.262     ms     0.00557446808511  ms/call       47   
  calls    0.00 %
   zeros_like             0.227     ms     0.00241489361702  ms/call       94   
  calls    0.00 %
   
   Total OP Time: 169075.73400000 ms
   ```
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to