[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

2020-05-11 Thread GitBox


kpuatamazon commented on issue #17980:
URL: 
https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-626563301


   @ChaiBapchya to be clear, here's how I am building the second option:
   ```bash
   export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
   unset LD_PRELOAD  #Technically this should be what exists in your 
environment by default
   rm -rf build
   mkdir build
   cd build
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
   ninja -j 30
   ```
   Note that `cmake` does not appear to pick up on changes to the build so it 
needs a fresh build directory (deleting the cache might work).  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

2020-05-04 Thread GitBox


kpuatamazon commented on issue #17980:
URL: 
https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-623737885


   DNNL has only a hidden unsupported option to link to MKL.  It will not link 
to MKL by default.  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

2020-04-13 Thread GitBox
kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected 
calls DNNL while dot and batch_dot call MKL
URL: 
https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-613009431
 
 
   In case somebody finds this issue and wants their optimized build, here is a 
different workaround that removes the need for `LD_PRELOAD`.  Just do this 
before running cmake the first time:
   ```bash
   export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
   ```
   Then `cmake` can be run normally:
   ```bash
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
   ```
   and the compiled MXNet can be run normally without any special environment 
variables.  
   
   To be clear, the above kludge is an undocumented abomination.  The problem 
with `cmake -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include` is that it 
links against MKL twice.  DNNL is hard-coded to link dynamically against 
`libmkl_rt.so`:
   
https://github.com/oneapi-src/oneDNN/blob/1b05a28eb9666efef83b281e4cc1936db5e6cf6c/cmake/MKL.cmake#L64
   Then MXNet also links statically against MKL.  And it still wants the shared 
library in `dlopen`.  So we should just remove the shared library.  How do we 
do that?  Implement everything else directly without using the build file.  
Specifically it implements in `CXXFLAGS` what the rest of build system does 
https://github.com/oneapi-src/oneDNN/blob/1b05a28eb9666efef83b281e4cc1936db5e6cf6c/cmake/MKL.cmake#L65-L67


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

2020-04-13 Thread GitBox
kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected 
calls DNNL while dot and batch_dot call MKL
URL: 
https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-612874407
 
 
   @TaoLv Regarding your comment
   > how to improve the performance of FullyConnected when DNNL is used. In 
fact, I see the performance of DNNL primitive kernel is good, compared with the 
output of the script. 
   
   I ran with both MKL and DNNL verbose options: `OMP_NUM_THREADS=1 
MKL_VERBOSE=1 DNNL_VERBOSE=1`
   
   
[verbose.txt](https://github.com/apache/incubator-mxnet/files/4469425/verbose.txt)
   
   To summarize the verbose output in microseconds:
   | Shape | MKL raw | DNNL raw | DNNL raw slowdown | MKL Python overhead | 
DNNL Python overhead |
   | --- | --- | --- | --- | --- | --- |
   | 5, 512, 512 | 10.29 | 31.61 | 3.07x | 66.21 | 63.69 | 
   | 5, 512, 1536 | 17.27 | 90.26 | 5.22x | 68.23 | 78.44 | 
   | 5, 512, 2048 | 16.16 | 81.48 | 5.04x | 75.84 | 84.92 | 
   | 5, 2048, 512 | 16.84 | 97.20 | 5.77x | 77.46 | 91.70 | 
   | 4, 512, 512 | 8.74 | 15.16 | 1.73x | 69.26 | 79.84 | 
   
   
   Raw is defined by the time reported from verbose output; note I converted 
DNNL ms to microseconds.  
   Python overhead is defined as time reported by the above Python script 
(converted to microseconds) minus raw time.  Note that MKL used `dot` whilst 
DNNL used `FullyConnected`.  
   
   So we really have four issues:
   1. Raw speed as reported by MKL's verbose output is much faster than raw 
speed reported by DNNL's verbose output.  This is definitely a speed problem 
with DNNL.  
   2. In my opinion, when I compile with `-DBLAS` set to something, it should 
use that BLAS.  You can add an `auto` option that picks according to the 
priorities on the website.  
   3. It's weird that `dot` calls MKL but `FullyConnected` calls DNNL given 
that they are both GEMMs.  I understand @pengzhao-intel is working on this.  
Though unless 1 and 2 above are fixed, it will only make performance worse.  
   4. MXNet / Python overhead is larger for `FullyConnected` than `dot`.  This 
is small compared to the DNNL vs MKL difference though.  We can also  question 
why there is so much overhead relative to the cost of the multiply.  
   
   Intel should fix DNNL but that will take time.  
   
   MXNet should make `-DBLAS` actually choose the BLAS routine both to support 
a short-term bypass and because it better matches user expectations of what a 
compilation flag does.  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

2020-04-13 Thread GitBox
kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected 
calls DNNL while dot and batch_dot call MKL
URL: 
https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-612836744
 
 
   Single-threaded benchmarks (`OMP_NUM_THREADS=1`) to confirm that it's not a 
difference in OMP library:
   | Shape | dot (MKL) | FullyConnected (MKL workaround) | FullyConnected 
(DNNL) | DNNL slowdown |
   | --- | --- | --- | --- | --- |
   | 5, 512, 512 | 0.914 | 0.0001016 | 0.0002448 | 2.40x |
   | 5, 512, 1536 | 0.0002939 | 0.0003051 | 0.0006576 | 2.15x |
   | 5, 512, 2048 | 0.0003828 | 0.0003919 | 0.0008582 | 2.18x |
   | 5, 2048, 512 | 0.0003681 | 0.0003785 | 0.0014638 | 3.86x |
   | 4, 512, 512 | 0.917 | 0.0001038 | 0.0002364 | 2.27x |
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

2020-04-06 Thread GitBox
kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected 
calls DNNL while dot and batch_dot call MKL
URL: 
https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-609906743
 
 
   > I don't think it's possible at this moment. 
   Can we please have a build option to force MKL usage for all GEMM calls?  
This is a pretty big performance regression.  
   
   > I think the real question is how to improve the performance of 
FullyConnected when DNNL is used.
   
   Hmmm, when DNNL delegates to MKL, that happens internally to DNNL, right?  
If the MXNet wrapper is the problem, then why is delegating to MKL much faster? 
 
   
   Applying my workaround:
   ```
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -D_DNNL_USE_MKL=FULL 
-DMKLINC=/opt/intel/mkl/include ..
   ninja -j 30
   export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so
   export OMP_NUM_THREADS=4
   ./mult_bench.py #NB I edited the script to label things more clearly.  
   ```
   Output:
   ```
   Shape (5, 512, 512)
   0.628 seconds for fullyconnected (MKL workaround)
   0.488 seconds for dot (MKL)
   Shape (5, 512, 1536)
   0.816 seconds for fullyconnected (MKL workaround)
   0.661 seconds for dot (MKL)
   Shape (5, 512, 2048)
   0.0001109 seconds for fullyconnected (MKL workaround)
   0.957 seconds for dot (MKL)
   Shape (5, 2048, 512)
   0.0001078 seconds for fullyconnected (MKL workaround)
   0.954 seconds for dot (MKL)
   Shape (4, 512, 512)
   0.501 seconds for fullyconnected (MKL workaround)
   0.434 seconds for dot (MKL)
   ```
   There is something less efficient about `FullyConnected` than `dot` but it's 
nowhere near explaining DNNL's slowdown.  
   
   FYI I usually wear this hat (as opposed to @kpu) on Mondays so expect slower 
responses on other days.  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

2020-04-06 Thread GitBox
kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected 
calls DNNL while dot and batch_dot call MKL
URL: 
https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-609875547
 
 
   This still uses MKLDNN for `FullyConnected`
   ```
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DBLAS=MKL 
-DUSE_MKL_IF_AVAILABLE=ON ..
   ```
   I was able to get MKL to run by disabling MKLDNN entirely:
   ```
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DBLAS=MKL 
-DUSE_MKL_IF_AVAILABLE=ON -DUSE_MKLDNN=OFF ..
   ```
   But then I lose other kernels like DNNL softmax and actually Sockeye 
randomly crashes:
   ```
   mxnet.base.MXNetError: MXNetError: Out of range value for value, 
value='inf', in operator _full(name="", dtype="float32", value="inf", 
ctx="cpu(0)", shape="(5, 1)")
   ```
   (bizarrely this happens after it's translated many sentences with that same 
code.)
   
   How do I achieve the fastest combination of DNNL softmax and MKL matrix 
multiply for `FullyConnected` using only documented options?  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

2020-04-06 Thread GitBox
kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected 
calls DNNL while dot and batch_dot call MKL
URL: 
https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-609859032
 
 
   I also tried the latest DNNL:
   ```bash
   cd 3rdparty
   rm -rf mkldnn
   git clone https://github.com/oneapi-src/oneDNN mkldnn
   cd ../build && ninja -j 32
   ```
   and DNNL is still slower:
   ```
   Shape (5, 512, 512)
   0.875 seconds for fullyconnected (DNNL)
   0.507 seconds for dot (MKL)
   Shape (5, 512, 1536)
   0.0002005 seconds for fullyconnected (DNNL)
   0.729 seconds for dot (MKL)
   Shape (5, 512, 2048)
   0.0002516 seconds for fullyconnected (DNNL)
   0.974 seconds for dot (MKL)
   Shape (5, 2048, 512)
   0.0003564 seconds for fullyconnected (DNNL)
   0.981 seconds for dot (MKL)
   Shape (4, 512, 512)
   0.928 seconds for fullyconnected (DNNL)
   0.497 seconds for dot (MKL)
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

2020-04-06 Thread GitBox
kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected 
calls DNNL while dot and batch_dot call MKL
URL: 
https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-609843360
 
 
   Here's some benchmarks on 2fff11d4233814aa4ad07858779338090ec2132d (current 
tip of master) with the same Skylake c5.9xlarge.  DNNL is substantially slower 
than MKL. 
   
   (Which DNNL is in master?)
   
   ```python
   #!/usr/bin/env python3
   import mxnet as mx
   import time
   
   def time_procedure(shape, count, proc):
 rows, inner, cols = shape
 a = mx.nd.random_uniform(shape=(rows, inner), low=-1.0, high=1.0)
 b = mx.nd.random_uniform(shape=(cols, inner), low=-1.0, high=1.0)
 # Burn in
 proc(a, b, cols)
 mx.nd.waitall()
 begin = time.time()
 for i in range(0, count):
   proc(a, b, cols)
   mx.nd.waitall()
 return (time.time() - begin) / count
   
   shapes = [(5, 512, 512), (5,512,1536), (5,512,2048), (5,2048,512), 
(4,512,512)]
   count = 1000
   
   procedures = {
 "fullyconnected (DNNL)" : (lambda a, b, cols : mx.nd.FullyConnected(a, b, 
no_bias=True, num_hidden=cols)),
 "dot (MKL)" : (lambda a, b, cols : mx.nd.dot(a, b, transpose_b = True))
   }
   for s in shapes:
 print("Shape " + str(s))
 stats = {}
 for name, l in procedures.items():
   stats[name] = time_procedure(s, count, l)
   print("{:.7f} seconds for {}".format(stats[name], name))
   ```
   Run as `OMP_NUM_THREADS=4 ./mult_bench.py`:
   ```
   Shape (5, 512, 512)
   0.961 seconds for fullyconnected (DNNL)
   0.509 seconds for dot (MKL)
   Shape (5, 512, 1536)
   0.0002011 seconds for fullyconnected (DNNL)
   0.735 seconds for dot (MKL)
   Shape (5, 512, 2048)
   0.0002521 seconds for fullyconnected (DNNL)
   0.0001027 seconds for dot (MKL)
   Shape (5, 2048, 512)
   0.0003569 seconds for fullyconnected (DNNL)
   0.0001018 seconds for dot (MKL)
   Shape (4, 512, 512)
   0.946 seconds for fullyconnected (DNNL)
   0.496 seconds for dot (MKL)
   ```
   
   I don't really mind what the default BLAS implementation is.  But choosing 
MKL should require undocumented compile options.  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services