kpuatamazon opened a new issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL URL: https://github.com/apache/incubator-mxnet/issues/17980 # Problem Not sure how much we care about MKL support but to the extent it still appears in the buld system, operator support should be consistent. When compiled with MKL present (MKL is found in `/opt/intel`), MXNet calls MKL for `dot` and `batch_dot` and DNNL for `fully_connected`. These are all GEMM operators; why is it inconsistent? This is making Sockeye decoding 22% slower (see below) without a workaround (below) to force use of MKL. This inconsistency did not matter much in MXNet 1.5.0 because MKLDNN would delegate to MKL. However, aa1074dc1704d3732ab205c43d48083ef8c69680 upgraded to MKLDNN 1.0, which hid the ability of MKLDNN to delegate to MKL: https://github.com/oneapi-src/oneDNN/commit/304915096d1def19999b963a60569ec46a882c16 . (MKLDNN has since been renamed DNNL.) Since MKLDNN only hid support for delegating to MKL, it's possible to restore delegatation (see workaround). # Testing Tested with MXNet cfb474ba743d5ea85161bf19875488f4cb409d3c. Compiled with mostly-default cmake settings: ```bash cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release .. ``` Then when I run ``` export MKL_VERBOSE=1 export MKLDNN_VERBOSE=1 python3 Python 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import mxnet as mx Numpy + Intel(R) MKL: THREADING LAYER: (null) Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime Numpy + Intel(R) MKL: preloading libiomp5.so runtime MKL_VERBOSE Intel(R) MKL 2019.0 Product build 20180829 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 3.00GHz lp64 intel_thread MKL_VERBOSE SDOT(2,0x1a0fdc0,1,0x1a0fdc0,1) 1.47ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:24 >>> a = mx.nd.ones(shape=(2,2)) >>> mx.nd.FullyConnected(a,a,num_hidden=2,no_bias=True) dnnl_verbose,info,DNNL v1.1.2 (commit cb2cc7ac17ff4e2ef50805c7048d33256d82be4d) dnnl_verbose,info,Detected ISA is Intel AVX-512 with Intel DL Boost dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0 wei_f32::blocked:ab:f0 bia_undef::undef::f0 dst_f32::blocked:ab:f0,,,mb2ic2oc2,74.9971 [[2. 2.] [2. 2.]] <NDArray 2x2 @cpu(0)> >>> a = mx.nd.ones(shape=(2,2,2)) >>> mx.nd.batch_dot(a,a) MKL_VERBOSE SGEMM_BATCH(N,N,0x7fc3238b809c,0x7fc3238b80a0,0x7fc3238b80a4,0x7fc3238b80b4,0x7fc228010b90,0x7fc3238b80a8,0x7fc22800f770,0x7fc3238b80ac,0x7fc3238b80b8,0x7fc2280190e0,0x7fc3238b80b0,0x7fc3238b7fc8,0x7 363.79us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:24 [[[2. 2.] [2. 2.]] [[2. 2.] [2. 2.]]] >>> mx.nd.dot(a,a) MKL_VERBOSE SGEMM(N,N,4,4,2,0x7fc3238b8198,0x7fc2280043c0,4,0x7fc2280043c0,2,0x7fc3238b81a0,0x7fc228004580,4) 8.52us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:24 [[[[2. 2.] [2. 2.]] [[2. 2.] [2. 2.]]] [[[2. 2.] [2. 2.]] [[2. 2.] [2. 2.]]]] <NDArray 2x2x2x2 @cpu(0)> ``` You can see DNNL is called for `FullyConnected` while MKL is called for `dot` and `batch_dot`. # Performance impact I timed Sockeye decoding. Commit https://github.com/apache/incubator-mxnet/commit/aa1074dc1704d3732ab205c43d48083ef8c69680 made decoding 22% slower (416.878s up from 342.037s for b5d07e30321da47d604b99048c1b57c03ec819b0) even with MKL installed in `/opt/intel/`. | Commit | Compilation | Time(s) | | --- | --- | --- | | b5d07e3 (before MKLDNN 1.0 change) | Default | 342.037 | | aa1074d (MKLDNN 1.0 change) | Default | 416.878 | | aa1074d (MKLDNN 1.0 change) | Workaround | 343.706 | | cfb474ba Recent | Default | 385.587 | | cfb474ba Recent | Workaround | 312.509 | (Default compilation is `cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..`; workaround compilation is below.) # Workaround Since DNNL hid support for delegating to MKL, it's still possible to turn delegation back on. ```bash cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include .. ``` which compiles but then triggers a link error at runtime `OSError: /home/ubuntu/mxnet/build/3rdparty/mkldnn/src/libmkldnn.so.1: undefined symbol: cblas_gemm_s8u8s32_pack` So I kludged it with `export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so` and was then able to use MXNet at runtime. There's probably a cleaner way of fixing the linkage. # Recommended fix When compiled with MKL, MXNet should call MKL directly from `FullyConnected` like it already does for `dot` and `batch_dot`.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
