akarbown commented on pull request #20474: URL: https://github.com/apache/incubator-mxnet/pull/20474#issuecomment-938160202
@szha, @leezu, @barry-jin - can I ask you for the review of that PR? I've added new github action pipeline to test and verify this change with oneMKL on MacOS. It revealed some things that were needed to fix for that OS i.e: - remove using `` -Wl, --start-group/ -Wl, --end-group`` while linking static MKL libraries in FindBLAS.cmake (issue mentioned here: https://gitlab.kitware.com/cmake/cmake/-/issues/20548); - set proper threading layer at runtime. According to the [documentation](https://software.intel.com/content/dam/develop/external/us/en/documents/onemkl-developerguide-mac.pdf) it need to be set mkl_set_threading_layer(MKL_THREADING_INTEL); - added github action (using existing [os_x_staticbuild.yml](https://github.com/apache/incubator-mxnet/blob/master/.github/workflows/os_x_staticbuild.yml)) that is building OneMKL with static MKL libraries (to be consistent with already existing scripts) and run the same tests that are for os_x_staticbuild.yml plus MKL tests; - fixed hangs that appeared while running those tests were the result of the numpy linking/using OpenBLAS instead of MKL BLAS and as a consequence it was linking libgomp which resulted in the hang (two OpenMP runtimes in one process). Recompiling it (done in the numpy_mkl.sh file) resolved the issue; - excluded ``test_bf16_operator`` tests for that action pipeline as CI MacOS seems to not support avx512; - tested locally MxNET linked with static, dynamic and SDL (Single Dynamic Library) on MacOS and all the tests (from the os_x_staticbuild.yml + MKL tests) seems to pass without any hang. Now it seems the change seems to be tested and checked for MacOS and with MKL BLAS. Do you think that leaving that new github action for MKL on MacOS make sense? If so, can it look as it is or change it somehow? **Remark**: I see that windows-gpu fails, but it's rather not connected with that change but maybe with the VS 2019 version 16.11 Release? As I see that for v16.8.1 (MSVC 19.28.29333.0) it [passed](https://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/windows-gpu/branches/PR-20474/runs/12/nodes/40/steps/84/log/?start=0) without any issues, while for v16.11.4 (MSVC 19.29.30136.0) it [fails](https://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/windows-gpu/branches/PR-20474/runs/13/nodes/40/steps/84/log/?start=0). But I'm not 100% sure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
