kpuatamazon edited a comment on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-612874407 @TaoLv Regarding your comment > how to improve the performance of FullyConnected when DNNL is used. In fact, I see the performance of DNNL primitive kernel is good, compared with the output of the script. I ran with both MKL and DNNL verbose options: `OMP_NUM_THREADS=1 MKL_VERBOSE=1 DNNL_VERBOSE=1` [verbose.txt](https://github.com/apache/incubator-mxnet/files/4469425/verbose.txt) To summarize the verbose output in microseconds: | Shape | MKL raw | DNNL raw | DNNL raw slowdown | MKL Python overhead | DNNL Python overhead | | --- | --- | --- | --- | --- | --- | | 5, 512, 512 | 10.29 | 31.61 | 3.07x | 66.21 | 63.69 | | 5, 512, 1536 | 17.27 | 90.26 | 5.22x | 68.23 | 78.44 | | 5, 512, 2048 | 16.16 | 81.48 | 5.04x | 75.84 | 84.92 | | 5, 2048, 512 | 16.84 | 97.20 | 5.77x | 77.46 | 91.70 | | 4, 512, 512 | 8.74 | 15.16 | 1.73x | 69.26 | 79.84 | Raw is defined by the time reported from verbose output; note I converted DNNL ms to microseconds. Python overhead is defined as time reported by the above Python script (converted to microseconds) minus raw time. Note that MKL used `dot` whilst DNNL used `FullyConnected`. In both cases, I excluded the first call for each shape to allow for JIT/burn-in. So we really have four issues: 1. Raw speed as reported by MKL's verbose output is much faster than raw speed reported by DNNL's verbose output. This is definitely a speed problem with DNNL. 2. In my opinion, when I compile with `-DBLAS` set to something, it should use that BLAS. You can add an `auto` option that picks according to the priorities on the website. 3. It's weird that `dot` calls MKL but `FullyConnected` calls DNNL given that they are both GEMMs. I understand @pengzhao-intel is working on this. Though unless 1 and 2 above are fixed, it will only make performance worse. 4. MXNet / Python overhead is larger for `FullyConnected` than `dot`. This is small compared to the DNNL vs MKL difference though. We can also question why there is so much overhead relative to the cost of the multiply. Intel should fix DNNL but that will take time. MXNet should make `-DBLAS` actually choose the BLAS routine both to support a short-term bypass and because it better matches user expectations of what a compilation flag does.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
