kpuatamazon opened a new issue #17980: When compiled with MKL, fully_connected 
calls DNNL while dot and batch_dot call MKL
URL: https://github.com/apache/incubator-mxnet/issues/17980
 
 
   # Problem
   Not sure how much we care about MKL support but to the extent it still 
appears in the buld system, operator support should be consistent.  
   
   When compiled with MKL present (MKL is found in `/opt/intel`), MXNet calls 
MKL for `dot` and `batch_dot` and DNNL for `fully_connected`.  These are all 
GEMM operators; why is it inconsistent? This is making Sockeye decoding 22% 
slower (see below) without a workaround (below) to force use of MKL.  
   
   This inconsistency did not matter much in MXNet 1.5.0 because MKLDNN would 
delegate to MKL.  However, aa1074dc1704d3732ab205c43d48083ef8c69680 upgraded to 
MKLDNN 1.0, which hid the ability of MKLDNN to delegate to MKL: 
https://github.com/oneapi-src/oneDNN/commit/304915096d1def19999b963a60569ec46a882c16
 .  (MKLDNN has since been renamed DNNL.)
   
   Since MKLDNN only hid support for delegating to MKL, it's possible to 
restore delegatation (see workaround).  
   
   # Testing 
   Tested with MXNet cfb474ba743d5ea85161bf19875488f4cb409d3c.  Compiled with 
mostly-default cmake settings:
   ```bash
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
   ```
   
   Then when I run 
   ```
   export MKL_VERBOSE=1
   export MKLDNN_VERBOSE=1
   python3
   Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
   [GCC 8.3.0] on linux
   Type "help", "copyright", "credits" or "license" for more information.
   >>> import mxnet as mx
   Numpy + Intel(R) MKL: THREADING LAYER: (null)
   Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
   Numpy + Intel(R) MKL: preloading libiomp5.so runtime
   MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180829 for Intel(R) 64 
architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled 
processors, Lnx 3.00GHz lp64 intel_thread
   MKL_VERBOSE SDOT(2,0x1a0fdc0,1,0x1a0fdc0,1) 1.47ms CNR:OFF Dyn:1 FastMM:1 
TID:0  NThr:24
   >>> a = mx.nd.ones(shape=(2,2))
   >>> mx.nd.FullyConnected(a,a,num_hidden=2,no_bias=True)
   dnnl_verbose,info,DNNL v1.1.2 (commit 
cb2cc7ac17ff4e2ef50805c7048d33256d82be4d)
   dnnl_verbose,info,Detected ISA is Intel AVX-512 with Intel DL Boost
   
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0
 wei_f32::blocked:ab:f0 bia_undef::undef::f0 
dst_f32::blocked:ab:f0,,,mb2ic2oc2,74.9971
   
   [[2. 2.]
    [2. 2.]]
   <NDArray 2x2 @cpu(0)>
   >>> a = mx.nd.ones(shape=(2,2,2))
   >>> mx.nd.batch_dot(a,a)
   MKL_VERBOSE 
SGEMM_BATCH(N,N,0x7fc3238b809c,0x7fc3238b80a0,0x7fc3238b80a4,0x7fc3238b80b4,0x7fc228010b90,0x7fc3238b80a8,0x7fc22800f770,0x7fc3238b80ac,0x7fc3238b80b8,0x7fc2280190e0,0x7fc3238b80b0,0x7fc3238b7fc8,0x7
 363.79us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:24
   
   [[[2. 2.]
     [2. 2.]]
   
    [[2. 2.]
     [2. 2.]]]
   >>> mx.nd.dot(a,a)
   MKL_VERBOSE 
SGEMM(N,N,4,4,2,0x7fc3238b8198,0x7fc2280043c0,4,0x7fc2280043c0,2,0x7fc3238b81a0,0x7fc228004580,4)
 8.52us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:24
   
   [[[[2. 2.]
      [2. 2.]]
   
     [[2. 2.]
      [2. 2.]]]
   
   
    [[[2. 2.]
      [2. 2.]]
   
     [[2. 2.]
      [2. 2.]]]]
   <NDArray 2x2x2x2 @cpu(0)>
   ```
   You can see DNNL is called for `FullyConnected` while MKL is called for 
`dot` and `batch_dot`.  
   
   # Performance impact
   I timed Sockeye decoding.  Commit 
https://github.com/apache/incubator-mxnet/commit/aa1074dc1704d3732ab205c43d48083ef8c69680
 made decoding 22% slower (416.878s up from 342.037s for 
b5d07e30321da47d604b99048c1b57c03ec819b0) even with MKL installed in 
`/opt/intel/`.  
   
   | Commit | Compilation | Time(s) |
   | --- | --- | --- |
   | b5d07e3 (before MKLDNN 1.0 change) | Default | 342.037 | 
   | aa1074d (MKLDNN 1.0 change) | Default | 416.878 |
   | aa1074d (MKLDNN 1.0 change) | Workaround | 343.706 |
   | cfb474ba Recent | Default | 385.587 |
   | cfb474ba Recent | Workaround | 312.509 |
   
   (Default compilation is `cmake -GNinja -DUSE_CUDA=OFF 
-DCMAKE_BUILD_TYPE=Release ..`; workaround compilation is below.)
   
   # Workaround
   Since DNNL hid support for delegating to MKL, it's still possible to turn 
delegation back on.
   ```bash
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -D_DNNL_USE_MKL=FULL 
-DMKLINC=/opt/intel/mkl/include ..
   ```
   which compiles but then triggers a link error at runtime `OSError: 
/home/ubuntu/mxnet/build/3rdparty/mkldnn/src/libmkldnn.so.1: undefined symbol: 
cblas_gemm_s8u8s32_pack`
   So I kludged it with `export 
LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so` and was then able to use 
MXNet at runtime.  There's probably a cleaner way of fixing the linkage.  
   
   # Recommended fix
   When compiled with MKL, MXNet should call MKL directly from `FullyConnected` 
like it already does for `dot` and `batch_dot`.  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to