[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
kpuatamazon commented on issue #17980: URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-626563301 @ChaiBapchya to be clear, here's how I am building the second option: ```bash export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include" unset LD_PRELOAD #Technically this should be what exists in your environment by default rm -rf build mkdir build cd build cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release .. ninja -j 30 ``` Note that `cmake` does not appear to pick up on changes to the build so it needs a fresh build directory (deleting the cache might work). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
kpuatamazon commented on issue #17980: URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-623737885 DNNL has only a hidden unsupported option to link to MKL. It will not link to MKL by default. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-613009431 In case somebody finds this issue and wants their optimized build, here is a different workaround that removes the need for `LD_PRELOAD`. Just do this before running cmake the first time: ```bash export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include" ``` Then `cmake` can be run normally: ```bash cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release .. ``` and the compiled MXNet can be run normally without any special environment variables. To be clear, the above kludge is an undocumented abomination. The problem with `cmake -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include` is that it links against MKL twice. DNNL is hard-coded to link dynamically against `libmkl_rt.so`: https://github.com/oneapi-src/oneDNN/blob/1b05a28eb9666efef83b281e4cc1936db5e6cf6c/cmake/MKL.cmake#L64 Then MXNet also links statically against MKL. And it still wants the shared library in `dlopen`. So we should just remove the shared library. How do we do that? Implement everything else directly without using the build file. Specifically it implements in `CXXFLAGS` what the rest of build system does https://github.com/oneapi-src/oneDNN/blob/1b05a28eb9666efef83b281e4cc1936db5e6cf6c/cmake/MKL.cmake#L65-L67 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-612874407 @TaoLv Regarding your comment > how to improve the performance of FullyConnected when DNNL is used. In fact, I see the performance of DNNL primitive kernel is good, compared with the output of the script. I ran with both MKL and DNNL verbose options: `OMP_NUM_THREADS=1 MKL_VERBOSE=1 DNNL_VERBOSE=1` [verbose.txt](https://github.com/apache/incubator-mxnet/files/4469425/verbose.txt) To summarize the verbose output in microseconds: | Shape | MKL raw | DNNL raw | DNNL raw slowdown | MKL Python overhead | DNNL Python overhead | | --- | --- | --- | --- | --- | --- | | 5, 512, 512 | 10.29 | 31.61 | 3.07x | 66.21 | 63.69 | | 5, 512, 1536 | 17.27 | 90.26 | 5.22x | 68.23 | 78.44 | | 5, 512, 2048 | 16.16 | 81.48 | 5.04x | 75.84 | 84.92 | | 5, 2048, 512 | 16.84 | 97.20 | 5.77x | 77.46 | 91.70 | | 4, 512, 512 | 8.74 | 15.16 | 1.73x | 69.26 | 79.84 | Raw is defined by the time reported from verbose output; note I converted DNNL ms to microseconds. Python overhead is defined as time reported by the above Python script (converted to microseconds) minus raw time. Note that MKL used `dot` whilst DNNL used `FullyConnected`. So we really have four issues: 1. Raw speed as reported by MKL's verbose output is much faster than raw speed reported by DNNL's verbose output. This is definitely a speed problem with DNNL. 2. In my opinion, when I compile with `-DBLAS` set to something, it should use that BLAS. You can add an `auto` option that picks according to the priorities on the website. 3. It's weird that `dot` calls MKL but `FullyConnected` calls DNNL given that they are both GEMMs. I understand @pengzhao-intel is working on this. Though unless 1 and 2 above are fixed, it will only make performance worse. 4. MXNet / Python overhead is larger for `FullyConnected` than `dot`. This is small compared to the DNNL vs MKL difference though. We can also question why there is so much overhead relative to the cost of the multiply. Intel should fix DNNL but that will take time. MXNet should make `-DBLAS` actually choose the BLAS routine both to support a short-term bypass and because it better matches user expectations of what a compilation flag does. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-612836744 Single-threaded benchmarks (`OMP_NUM_THREADS=1`) to confirm that it's not a difference in OMP library: | Shape | dot (MKL) | FullyConnected (MKL workaround) | FullyConnected (DNNL) | DNNL slowdown | | --- | --- | --- | --- | --- | | 5, 512, 512 | 0.914 | 0.0001016 | 0.0002448 | 2.40x | | 5, 512, 1536 | 0.0002939 | 0.0003051 | 0.0006576 | 2.15x | | 5, 512, 2048 | 0.0003828 | 0.0003919 | 0.0008582 | 2.18x | | 5, 2048, 512 | 0.0003681 | 0.0003785 | 0.0014638 | 3.86x | | 4, 512, 512 | 0.917 | 0.0001038 | 0.0002364 | 2.27x | This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-609906743 > I don't think it's possible at this moment. Can we please have a build option to force MKL usage for all GEMM calls? This is a pretty big performance regression. > I think the real question is how to improve the performance of FullyConnected when DNNL is used. Hmmm, when DNNL delegates to MKL, that happens internally to DNNL, right? If the MXNet wrapper is the problem, then why is delegating to MKL much faster? Applying my workaround: ``` cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -D_DNNL_USE_MKL=FULL -DMKLINC=/opt/intel/mkl/include .. ninja -j 30 export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so export OMP_NUM_THREADS=4 ./mult_bench.py #NB I edited the script to label things more clearly. ``` Output: ``` Shape (5, 512, 512) 0.628 seconds for fullyconnected (MKL workaround) 0.488 seconds for dot (MKL) Shape (5, 512, 1536) 0.816 seconds for fullyconnected (MKL workaround) 0.661 seconds for dot (MKL) Shape (5, 512, 2048) 0.0001109 seconds for fullyconnected (MKL workaround) 0.957 seconds for dot (MKL) Shape (5, 2048, 512) 0.0001078 seconds for fullyconnected (MKL workaround) 0.954 seconds for dot (MKL) Shape (4, 512, 512) 0.501 seconds for fullyconnected (MKL workaround) 0.434 seconds for dot (MKL) ``` There is something less efficient about `FullyConnected` than `dot` but it's nowhere near explaining DNNL's slowdown. FYI I usually wear this hat (as opposed to @kpu) on Mondays so expect slower responses on other days. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-609875547 This still uses MKLDNN for `FullyConnected` ``` cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DBLAS=MKL -DUSE_MKL_IF_AVAILABLE=ON .. ``` I was able to get MKL to run by disabling MKLDNN entirely: ``` cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DBLAS=MKL -DUSE_MKL_IF_AVAILABLE=ON -DUSE_MKLDNN=OFF .. ``` But then I lose other kernels like DNNL softmax and actually Sockeye randomly crashes: ``` mxnet.base.MXNetError: MXNetError: Out of range value for value, value='inf', in operator _full(name="", dtype="float32", value="inf", ctx="cpu(0)", shape="(5, 1)") ``` (bizarrely this happens after it's translated many sentences with that same code.) How do I achieve the fastest combination of DNNL softmax and MKL matrix multiply for `FullyConnected` using only documented options? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-609859032 I also tried the latest DNNL: ```bash cd 3rdparty rm -rf mkldnn git clone https://github.com/oneapi-src/oneDNN mkldnn cd ../build && ninja -j 32 ``` and DNNL is still slower: ``` Shape (5, 512, 512) 0.875 seconds for fullyconnected (DNNL) 0.507 seconds for dot (MKL) Shape (5, 512, 1536) 0.0002005 seconds for fullyconnected (DNNL) 0.729 seconds for dot (MKL) Shape (5, 512, 2048) 0.0002516 seconds for fullyconnected (DNNL) 0.974 seconds for dot (MKL) Shape (5, 2048, 512) 0.0003564 seconds for fullyconnected (DNNL) 0.981 seconds for dot (MKL) Shape (4, 512, 512) 0.928 seconds for fullyconnected (DNNL) 0.497 seconds for dot (MKL) ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL
kpuatamazon commented on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL URL: https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-609843360 Here's some benchmarks on 2fff11d4233814aa4ad07858779338090ec2132d (current tip of master) with the same Skylake c5.9xlarge. DNNL is substantially slower than MKL. (Which DNNL is in master?) ```python #!/usr/bin/env python3 import mxnet as mx import time def time_procedure(shape, count, proc): rows, inner, cols = shape a = mx.nd.random_uniform(shape=(rows, inner), low=-1.0, high=1.0) b = mx.nd.random_uniform(shape=(cols, inner), low=-1.0, high=1.0) # Burn in proc(a, b, cols) mx.nd.waitall() begin = time.time() for i in range(0, count): proc(a, b, cols) mx.nd.waitall() return (time.time() - begin) / count shapes = [(5, 512, 512), (5,512,1536), (5,512,2048), (5,2048,512), (4,512,512)] count = 1000 procedures = { "fullyconnected (DNNL)" : (lambda a, b, cols : mx.nd.FullyConnected(a, b, no_bias=True, num_hidden=cols)), "dot (MKL)" : (lambda a, b, cols : mx.nd.dot(a, b, transpose_b = True)) } for s in shapes: print("Shape " + str(s)) stats = {} for name, l in procedures.items(): stats[name] = time_procedure(s, count, l) print("{:.7f} seconds for {}".format(stats[name], name)) ``` Run as `OMP_NUM_THREADS=4 ./mult_bench.py`: ``` Shape (5, 512, 512) 0.961 seconds for fullyconnected (DNNL) 0.509 seconds for dot (MKL) Shape (5, 512, 1536) 0.0002011 seconds for fullyconnected (DNNL) 0.735 seconds for dot (MKL) Shape (5, 512, 2048) 0.0002521 seconds for fullyconnected (DNNL) 0.0001027 seconds for dot (MKL) Shape (5, 2048, 512) 0.0003569 seconds for fullyconnected (DNNL) 0.0001018 seconds for dot (MKL) Shape (4, 512, 512) 0.946 seconds for fullyconnected (DNNL) 0.496 seconds for dot (MKL) ``` I don't really mind what the default BLAS implementation is. But choosing MKL should require undocumented compile options. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services