szha opened a new issue #18244: URL: https://github.com/apache/incubator-mxnet/issues/18244
# Description Since #18146 we introduced parallel testing in CI with the hope of reducing test time. However, during that effort we noticed that the MKL and MKLDNN tests run slower than the setting without MKL or MKLDNN. This issue summarizes the set up and current time difference. ## Setup The results in this issue come from [this CI run](http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-18240/2/pipeline/), and the time difference is similar in master branch validation. To show the results, we compare the following test nodes. ### [Python 3: CPU](http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-18240/2/pipeline/363) #### [Build](https://github.com/apache/incubator-mxnet/blob/c329a524c520ec2b41aa4bff5ee113ff7221b790/ci/docker/runtime_functions.sh#L390-L405) ``` build_ubuntu_cpu_openblas() { set -ex cd /work/build CXXFLAGS="-Wno-error=strict-overflow" CC=gcc-7 CXX=g++-7 cmake \ -DCMAKE_BUILD_TYPE="RelWithDebInfo" \ -DENABLE_TESTCOVERAGE=ON \ -DUSE_TVM_OP=ON \ -DUSE_CPP_PACKAGE=ON \ -DUSE_MKL_IF_AVAILABLE=OFF \ -DUSE_MKLDNN=OFF \ -DUSE_CUDA=OFF \ -DUSE_DIST_KVSTORE=ON \ -DBUILD_CYTHON_MODULES=ON \ -G Ninja /work/mxnet ninja } ``` ### [Python 3: MKL-CPU](http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-18240/2/pipeline/364) #### [Build](https://github.com/apache/incubator-mxnet/blob/c329a524c520ec2b41aa4bff5ee113ff7221b790/ci/docker/runtime_functions.sh#L425-L438) ``` build_ubuntu_cpu_mkl() { set -ex cd /work/build CC=gcc-7 CXX=g++-7 cmake \ -DCMAKE_BUILD_TYPE="RelWithDebInfo" \ -DENABLE_TESTCOVERAGE=ON \ -DUSE_MKLDNN=OFF \ -DUSE_CUDA=OFF \ -DUSE_TVM_OP=ON \ -DUSE_MKL_IF_AVAILABLE=ON \ -DUSE_BLAS=MKL \ -GNinja /work/mxnet ninja } ``` ### [Python 3: MKLDNN-CPU](http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-18240/2/pipeline/365) #### [Build](https://github.com/apache/incubator-mxnet/blob/c329a524c520ec2b41aa4bff5ee113ff7221b790/ci/docker/runtime_functions.sh#L632-L645) ``` build_ubuntu_cpu_mkldnn() { set -ex cd /work/build CC=gcc-7 CXX=g++-7 cmake \ -DCMAKE_BUILD_TYPE="RelWithDebInfo" \ -DENABLE_TESTCOVERAGE=ON \ -DUSE_MKL_IF_AVAILABLE=OFF \ -DUSE_TVM_OP=ON \ -DUSE_MKLDNN=ON \ -DUSE_CUDA=OFF \ -DUSE_CPP_PACKAGE=ON \ -G Ninja /work/mxnet ninja } ``` ### [Python 3: MKLDNN-MKL-CPU](http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-18240/2/pipeline/366) #### [Build](https://github.com/apache/incubator-mxnet/blob/c329a524c520ec2b41aa4bff5ee113ff7221b790/ci/docker/runtime_functions.sh#L647-L660) ``` build_ubuntu_cpu_mkldnn_mkl() { set -ex cd /work/build CC=gcc-7 CXX=g++-7 cmake \ -DCMAKE_BUILD_TYPE="RelWithDebInfo" \ -DENABLE_TESTCOVERAGE=ON \ -DUSE_MKLDNN=ON \ -DUSE_CUDA=OFF \ -DUSE_TVM_OP=ON \ -DUSE_MKL_IF_AVAILABLE=ON \ -DUSE_BLAS=MKL \ -GNinja /work/mxnet ninja } ``` ## Tests Each of the test node runs one of the two following test functions ### [python3_ut](https://github.com/apache/incubator-mxnet/blob/c329a524c520ec2b41aa4bff5ee113ff7221b790/ci/jenkins/Jenkins_steps.groovy#L47-L51) ``` unittest_ubuntu_python3_cpu() { set -ex export PYTHONPATH=./python/ export MXNET_MKLDNN_DEBUG=0 # Ignored if not present export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0 export MXNET_SUBGRAPH_VERBOSE=0 export MXNET_ENABLE_CYTHON=0 export DMLC_LOG_STACK_TRACE_DEPTH=10 pytest -m 'not serial' -n 4 --durations=50 --cov-report xml:tests_unittest.xml --verbose tests/python/unittest pytest -m 'serial' --durations=50 --cov-report xml:tests_unittest.xml --cov-append --verbose tests/python/unittest pytest -n 4 --durations=50 --cov-report xml:tests_quantization.xml --verbose tests/python/quantization } ``` ### [python_ut_mkldnn](https://github.com/apache/incubator-mxnet/blob/c329a524c520ec2b41aa4bff5ee113ff7221b790/ci/jenkins/Jenkins_steps.groovy#L53-L57) ``` unittest_ubuntu_python3_cpu_mkldnn() { set -ex export PYTHONPATH=./python/ export MXNET_MKLDNN_DEBUG=0 # Ignored if not present export MXNET_STORAGE_FALLBACK_LOG_VERBOSE=0 export MXNET_SUBGRAPH_VERBOSE=0 export MXNET_ENABLE_CYTHON=0 export DMLC_LOG_STACK_TRACE_DEPTH=10 pytest -m 'not serial' -n 4 --durations=50 --cov-report xml:tests_unittest.xml --verbose tests/python/unittest pytest -m 'serial' --durations=50 --cov-report xml:tests_unittest.xml --cov-append --verbose tests/python/unittest pytest -n 4 --durations=50 --cov-report xml:tests_mkl.xml --verbose tests/python/mkl } ``` ### Test steps In order to show fine-grain time result, I break down each of the steps in the test nodes as the following: - docker launch: the time it takes to prepare and launch the docker - parallel unittest: `pytest -m 'not serial' -n 4 --durations=50 --cov-report xml:tests_unittest.xml --verbose tests/python/unittest` - serial unittest: `pytest -m 'serial' --durations=50 --cov-report xml:tests_unittest.xml --cov-append --verbose tests/python/unittest` - quantization test: `pytest -n 4 --durations=50 --cov-report xml:tests_quantization.xml --verbose tests/python/quantization` - mkl test: `pytest -n 4 --durations=50 --cov-report xml:tests_mkl.xml --verbose tests/python/mkl` # Results The unit for the following results are seconds. `Python 3: CPU` results are considered baseline. | Test | docker launch | parallel unittest | serial unittest | quantization | mkl test | |--------------------------|---------------|--------------------------|--------------------|--------------|----------| | Python 3: CPU | 892 | 981 | 951 | 39 | N/A | | Python 3: MKL-CPU | 874 | <b>4996 (+409%)</b> | <b>4380 (+361%)</b>| 67 | N/A | | Python 3: MKLDNN-CPU | 892 | <b>5096 (+419%)</b> | <b>3314 (+248%)</b>| N/A | 1080 | | Python 3: MKLDNN-MKL-CPU | 901 | <b>3899 (+397%)</b> | <b>3507 (+269%)</b>| N/A | 1210 | ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
