kpuatamazon edited a comment on issue #17980: When compiled with MKL, 
fully_connected calls DNNL while dot and batch_dot call MKL
URL: 
https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-612874407
 
 
   @TaoLv Regarding your comment
   > how to improve the performance of FullyConnected when DNNL is used. In 
fact, I see the performance of DNNL primitive kernel is good, compared with the 
output of the script. 
   
   I ran with both MKL and DNNL verbose options: `OMP_NUM_THREADS=1 
MKL_VERBOSE=1 DNNL_VERBOSE=1`
   
   
[verbose.txt](https://github.com/apache/incubator-mxnet/files/4469425/verbose.txt)
   
   To summarize the verbose output in microseconds:
   | Shape | MKL raw | DNNL raw | DNNL raw slowdown | MKL Python overhead | 
DNNL Python overhead |
   | --- | --- | --- | --- | --- | --- |
   | 5, 512, 512 | 10.29 | 31.61 | 3.07x | 66.21 | 63.69 | 
   | 5, 512, 1536 | 17.27 | 90.26 | 5.22x | 68.23 | 78.44 | 
   | 5, 512, 2048 | 16.16 | 81.48 | 5.04x | 75.84 | 84.92 | 
   | 5, 2048, 512 | 16.84 | 97.20 | 5.77x | 77.46 | 91.70 | 
   | 4, 512, 512 | 8.74 | 15.16 | 1.73x | 69.26 | 79.84 | 
   
   
   Raw is defined by the time reported from verbose output; note I converted 
DNNL ms to microseconds.  
   Python overhead is defined as time reported by the above Python script 
(converted to microseconds) minus raw time.  Note that MKL used `dot` whilst 
DNNL used `FullyConnected`.  
   
   In both cases, I excluded the first call for each shape to allow for 
JIT/burn-in.  
   
   So we really have four issues:
   1. Raw speed as reported by MKL's verbose output is much faster than raw 
speed reported by DNNL's verbose output.  This is definitely a speed problem 
with DNNL.  
   2. In my opinion, when I compile with `-DBLAS` set to something, it should 
use that BLAS.  You can add an `auto` option that picks according to the 
priorities on the website.  
   3. It's weird that `dot` calls MKL but `FullyConnected` calls DNNL given 
that they are both GEMMs.  I understand @pengzhao-intel is working on this.  
Though unless 1 and 2 above are fixed, it will only make performance worse.  
   4. MXNet / Python overhead is larger for `FullyConnected` than `dot`.  This 
is small compared to the DNNL vs MKL difference though.  We can also  question 
why there is so much overhead relative to the cost of the multiply.  
   
   Intel should fix DNNL but that will take time.  
   
   MXNet should make `-DBLAS` actually choose the BLAS routine both to support 
a short-term bypass and because it better matches user expectations of what a 
compilation flag does.  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to