XiaotaoChen opened a new pull request #13150: support mkl log when dtype is fp32 or fp64 URL: https://github.com/apache/incubator-mxnet/pull/13150 ## Description ## 1. support MKL Log implemented with vectorization; 2. when data's type is fp32 or fp64 and req is `kWriteTo` , `mx.nd.log` will call MKL Log. otherwise, it will call the default implementation. 3. **According to the statistics in c code, it gains 2-10 times speedup with different size data**. in python layer, it can gain 1.2-4.4 times speedup. detail data as below. 4. **speedup in [Sockeye](https://github.com/awslabs/sockeye) model : according profile info: `nd.log` gains 2.67 times speedup compared with the default implementation. the total time of log operator dropped from 11.5s to 4.3s**. profile info as below. @pengzhao-intel ## detail data **hardware: skx-8180 single socket(28 cores)** #### statistics in c++ code | shape | size | float32(us) | | | double(us) | | | | -------------- | --------- | ---------------- | --------------- | ----------- | ---------------- | --------------- | ----------- | | | | default-cal-time | MKLLog-cal-time | cal-speedup | default-cal-time | MKLLog-cal-time | cal-speedup | | (1000, 1) | 1000 | 6.02333 | 0.7816 | 7.706409928 | 7.00213 | 1.58913 | 4.406266322 | | (1000, 10) | 10000 | 9.4944 | 4.0589 | 2.339155929 | 18.8873 | 10.0668 | 1.876197004 | | (1000, 100) | 100000 | 47.3687 | 35.2625 | 1.343316554 | 137.633 | 30.9184 | 4.451491668 | | (1000, 1000) | 1000000 | 390.641 | 64.6105 | 6.04609158 | 1309 | 127.027 | 10.30489581 | | (1000, 10000) | 10000000 | 4516.87 | 1601.08 | 2.821139481 | 13872.1 | 3071.23 | 4.51678969 | | (1000, 100000) | 100000000 | 41938.7 | 15814.1 | 2.65198146 | 135445 | 30538.2 | 4.435264685 | #### statistics in python front end here is [test script](https://github.com/XiaotaoChen/incubator-mxnet/blob/cxt-test/tests/cxt-test/mkl-log/test_mkl_log.py) | shape | size | float32(us) | | | double(us) | | | | -------------- | --------- | --------------------- | -------------------- | ---------------- | --------------------- | -------------------- | ---------------- | | | | default-callback-time | MKLLog-callback-time | callback-speedup | default-callback-time | MKLLog-callback-time | callback-speedup | | (1000, 1) | 1000 | 40.650368 | 31.105677 | 1.306847236 | 40.729841 | 31.542778 | 1.291257257 | | (1000, 10) | 10000 | 43.606758 | 34.761429 | 1.254458152 | 53.819021 | 41.135152 | 1.308346229 | | (1000, 100) | 100000 | 119.781494 | 76.246262 | 1.570981854 | 203.490257 | 77.493985 | 2.625884538 | | (1000, 1000) | 1000000 | 1453.781128 | 1190.233231 | 1.221425423 | 2601.218224 | 1967.120171 | 1.322348407 | | (1000, 10000) | 10000000 | 8755.636215 | 8626.850446 | 1.014928481 | 14128.685 | 8942.484856 | 1.579950677 | | (1000, 100000) | 100000000 | 42178.8613 | 16119.05098 | 2.616708723 | 135685.5551 | 30782.89032 | 4.407823753 | ### sockeye profile on skx-8180 single socket **offical master** ```shell Time of each OP: FullyConnected 47266.636 ms 0.77306329528 ms/call 61142 calls 27.81 % CopyCPU2CPU 23471.975 ms 0.640069128194 ms/call 36671 calls 13.81 % SliceChannel 18618.516 ms 0.611847387447 ms/call 30430 calls 10.96 % Reshape 17875.549 ms 2.03895848067 ms/call 8767 calls 10.52 % where 12352.738 ms 2.36099732416 ms/call 5232 calls 7.27 % log 11476.426 ms 6.58051949541 ms/call 1744 calls 6.75 % take 6019.863 ms 0.153309810014 ms/call 39266 calls 3.54 % softmax 4899.333 ms 1.4046252867 ms/call 3488 calls 2.88 % Activation 4117.32 ms 0.0284441558262 ms/call 144751 calls 2.42 % _mul_scalar 4105.616 ms 2.35413761468 ms/call 1744 calls 2.42 % elemwise_add 3983.712 ms 0.0659664182812 ms/call 60390 calls 2.34 % Concat 3906.116 ms 0.22158588609 ms/call 17628 calls 2.30 % broadcast_add 2765.45 ms 1.5440815187 ms/call 1791 calls 1.63 % batch_dot 2552.298 ms 0.731736811927 ms/call 3488 calls 1.50 % DeleteVariable 1857.73 ms 0.0294401128332 ms/call 63102 calls 1.09 % LayerNorm 1476.755 ms 0.591648637821 ms/call 2496 calls 0.87 % repeat 1177.079 ms 1.19258257345 ms/call 987 calls 0.69 % elemwise_mul 1133.412 ms 0.0132791114548 ms/call 85353 calls 0.67 % _slice_assign 364.709 ms 0.121246343085 ms/call 3008 calls 0.21 % SetupExec 70.361 ms 0.000782510537496 ms/call 89917 calls 0.04 % Dropout 57.91 ms 0.0163818953324 ms/call 3535 calls 0.03 % Embedding 52.36 ms 0.0292350642099 ms/call 1791 calls 0.03 % _full 52.006 ms 0.0282948857454 ms/call 1838 calls 0.03 % sum 41.332 ms 0.0230776102736 ms/call 1791 calls 0.02 % stack 37.153 ms 0.263496453901 ms/call 141 calls 0.02 % SequenceMask 34.81 ms 0.0199598623853 ms/call 1744 calls 0.02 % expand_dims 22.422 ms 0.00267949330784 ms/call 8368 calls 0.01 % _equal_scalar 21.718 ms 0.00311324541284 ms/call 6976 calls 0.01 % WaitForVar 20.932 ms 0.00427969740339 ms/call 4891 calls 0.01 % broadcast_logical_or 17.47 ms 0.00500860091743 ms/call 3488 calls 0.01 % broadcast_logical_and 11.335 ms 0.0064994266055 ms/call 1744 calls 0.01 % _zeros 10.968 ms 0.0101461609621 ms/call 1081 calls 0.01 % SwapAxis 10.414 ms 0.110787234043 ms/call 94 calls 0.01 % _greater_equal 9.397 ms 0.00538818807339 ms/call 1744 calls 0.01 % broadcast_logical_xor 9.289 ms 0.00532626146789 ms/call 1744 calls 0.01 % elemwise_div 8.07 ms 0.00462729357798 ms/call 1744 calls 0.00 % logical_not 7.188 ms 0.00412155963303 ms/call 1744 calls 0.00 % SequenceReverse 6.896 ms 0.0733617021277 ms/call 94 calls 0.00 % _div_scalar 6.207 ms 0.00355905963303 ms/call 1744 calls 0.00 % tile 3.75 ms 0.0797872340426 ms/call 47 calls 0.00 % SequenceLast 1.559 ms 0.033170212766 ms/call 47 calls 0.00 % argsort 0.786 ms 0.0167234042553 ms/call 47 calls 0.00 % broadcast_to 0.763 ms 0.0162340425532 ms/call 47 calls 0.00 % broadcast_not_equal 0.513 ms 0.010914893617 ms/call 47 calls 0.00 % _slice_assign_scalar 0.423 ms 0.009 ms/call 47 calls 0.00 % _ones 0.346 ms 0.00736170212766 ms/call 47 calls 0.00 % _unravel_index 0.333 ms 0.00708510638298 ms/call 47 calls 0.00 % Cast 0.254 ms 0.00540425531915 ms/call 47 calls 0.00 % zeros_like 0.229 ms 0.00243617021277 ms/call 94 calls 0.00 % Total OP Time: 169938.42700000 ms ``` **mkl log** ```shell Time of each OP: FullyConnected 48077.779 ms 0.786329838736 ms/call 61142 calls 28.44 % CopyCPU2CPU 23618.186 ms 0.644056229718 ms/call 36671 calls 13.97 % SliceChannel 19328.791 ms 0.635188662504 ms/call 30430 calls 11.43 % Reshape 17722.157 ms 2.02146195962 ms/call 8767 calls 10.48 % where 12821.864 ms 2.45066207951 ms/call 5232 calls 7.58 % take 9492.064 ms 0.24173748281 ms/call 39266 calls 5.61 % softmax 5048.474 ms 1.44738360092 ms/call 3488 calls 2.99 % Activation 4407.896 ms 0.0304515754641 ms/call 144751 calls 2.61 % _mul_scalar 4269.909 ms 2.44834231651 ms/call 1744 calls 2.53 % log 4144.673 ms 2.37653268349 ms/call 1744 calls 2.45 % elemwise_add 4128.241 ms 0.0683596787548 ms/call 60390 calls 2.44 % Concat 4081.049 ms 0.231509473565 ms/call 17628 calls 2.41 % broadcast_add 2975.819 ms 1.66154048018 ms/call 1791 calls 1.76 % batch_dot 2289.026 ms 0.656257454128 ms/call 3488 calls 1.35 % DeleteVariable 1865.181 ms 0.0295567863085 ms/call 63105 calls 1.10 % LayerNorm 1521.639 ms 0.609631009615 ms/call 2496 calls 0.90 % elemwise_mul 1155.257 ms 0.013535048563 ms/call 85353 calls 0.68 % repeat 1141.611 ms 1.15664741641 ms/call 987 calls 0.68 % _slice_assign 410.251 ms 0.136386635638 ms/call 3008 calls 0.24 % Embedding 103.998 ms 0.058067001675 ms/call 1791 calls 0.06 % SetupExec 73.053 ms 0.000812449258761 ms/call 89917 calls 0.04 % Dropout 61.236 ms 0.0173227722772 ms/call 3535 calls 0.04 % _full 51.833 ms 0.0282007616975 ms/call 1838 calls 0.03 % sum 39.877 ms 0.0222652149637 ms/call 1791 calls 0.02 % stack 36.832 ms 0.261219858156 ms/call 141 calls 0.02 % SequenceMask 36.207 ms 0.0207608944954 ms/call 1744 calls 0.02 % expand_dims 22.735 ms 0.00271689770554 ms/call 8368 calls 0.01 % _equal_scalar 22.159 ms 0.00317646215596 ms/call 6976 calls 0.01 % WaitForVar 19.248 ms 0.00393378295524 ms/call 4893 calls 0.01 % broadcast_logical_or 18.373 ms 0.00526748853211 ms/call 3488 calls 0.01 % broadcast_logical_and 11.88 ms 0.0068119266055 ms/call 1744 calls 0.01 % SwapAxis 10.695 ms 0.113776595745 ms/call 94 calls 0.01 % _zeros 10.567 ms 0.00977520814061 ms/call 1081 calls 0.01 % _greater_equal 9.364 ms 0.00536926605505 ms/call 1744 calls 0.01 % broadcast_logical_xor 9.056 ms 0.00519266055046 ms/call 1744 calls 0.01 % elemwise_div 7.75 ms 0.00444380733945 ms/call 1744 calls 0.00 % logical_not 7.266 ms 0.00416628440367 ms/call 1744 calls 0.00 % SequenceReverse 7.196 ms 0.0765531914894 ms/call 94 calls 0.00 % _div_scalar 6.596 ms 0.00378211009174 ms/call 1744 calls 0.00 % tile 3.689 ms 0.0784893617021 ms/call 47 calls 0.00 % SequenceLast 1.646 ms 0.0350212765957 ms/call 47 calls 0.00 % _slice_assign_scalar 1.079 ms 0.0229574468085 ms/call 47 calls 0.00 % broadcast_to 1.013 ms 0.0215531914894 ms/call 47 calls 0.00 % argsort 0.761 ms 0.0161914893617 ms/call 47 calls 0.00 % broadcast_not_equal 0.547 ms 0.0116382978723 ms/call 47 calls 0.00 % _ones 0.395 ms 0.00840425531915 ms/call 47 calls 0.00 % _unravel_index 0.327 ms 0.00695744680851 ms/call 47 calls 0.00 % Cast 0.262 ms 0.00557446808511 ms/call 47 calls 0.00 % zeros_like 0.227 ms 0.00241489361702 ms/call 94 calls 0.00 % Total OP Time: 169075.73400000 ms ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
