josephevans opened a new issue #20643: URL: https://github.com/apache/incubator-mxnet/issues/20643
## Description The following tests keep failing consistently in the v1.9.x branch: 1. tests/python/unittest/test_gluon.py/test_gluon.py - test_hybrid_static_memory_switching 2. tests/cpp/operator/mkldnn_test.cc:103 - MKLDNN_UTIL_FUNC.MemFormat 3. tests/cpp/thread_safety/thread_safety_test.cc:314 - ThreadSafety.CachedOpFullModel ## Occurrences 1. test_gluon.test_hybrid_static_memory_switching examples: - Python3: MKLDNN-MKL-CPU stage: https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20626/16/pipeline/295 - Python3: MKLDNN-CPU stage: https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20626/16/pipeline/294 2. MKLDNN_UTIL_FUNC.MemFormat examples: - Cpp: MKLDNN+GPU stage: https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-20626/14/pipeline 3. ThreadSafety.CachedOpFullModel examples: - capi-cpp-package GPU Makefile stage: https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-20626/14/pipeline/366 ## Test Failure Log Output 1. test_gluon.test_hybrid_static_memory_switching ``` [2021-10-07T01:27:03.835Z] ====================================================================== [2021-10-07T01:27:03.835Z] ERROR: test_gluon.test_hybrid_static_memory_switching [2021-10-07T01:27:03.835Z] ---------------------------------------------------------------------- [2021-10-07T01:27:03.835Z] Traceback (most recent call last): [2021-10-07T01:27:03.835Z] File "/usr/local/lib/python3.7/dist-packages/nose/case.py", line 198, in runTest [2021-10-07T01:27:03.835Z] self.test(*self.arg) [2021-10-07T01:27:03.835Z] File "/work/mxnet/tests/python/unittest/common.py", line 218, in test_new [2021-10-07T01:27:03.835Z] orig_test(*args, **kwargs) [2021-10-07T01:27:03.835Z] File "/work/mxnet/tests/python/unittest/test_gluon.py", line 1760, in test_hybrid_static_memory_switching [2021-10-07T01:27:03.835Z] check_hybrid_static_memory_switching(static_alloc=True) [2021-10-07T01:27:03.835Z] File "/work/mxnet/tests/python/unittest/test_gluon.py", line 1755, in check_hybrid_static_memory_switching [2021-10-07T01:27:03.835Z] mx.nd.waitall() [2021-10-07T01:27:03.835Z] File "/work/mxnet/python/mxnet/ndarray/ndarray.py", line 211, in waitall [2021-10-07T01:27:03.835Z] check_call(_LIB.MXNDArrayWaitAll()) [2021-10-07T01:27:03.835Z] File "/work/mxnet/python/mxnet/base.py", line 246, in check_call [2021-10-07T01:27:03.835Z] raise get_last_ffi_error() [2021-10-07T01:27:03.835Z] mxnet.base.MXNetError: Traceback (most recent call last): [2021-10-07T01:27:03.835Z] [bt] (9) /work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x147) [0x7f9183955ee7] [2021-10-07T01:27:03.835Z] [bt] (8) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x2d8) [0x7f9183942738] [2021-10-07T01:27:03.835Z] [bt] (7) /work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete), mxnet::engine::ThreadedEngine::BulkFlush()::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&, mxnet::engine::CallbackOnComplete&&)+0x1c6) [0x7f9183940056] [2021-10-07T01:27:03.835Z] [bt] (6) /work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushFComputeEx(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1} >::_M_invoke(std::_Any_data const&, mxnet::RunContext&&)+0x17) [0x7f91838680d7] [2021-10-07T01:27:03.835Z] [bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::PushFComputeEx(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x293) [0x7 f9183867f43] [2021-10-07T01:27:03.835Z] [bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x5658020) [0x7f91835bb020] [2021-10-07T01:27:03.835Z] [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::MKLDNNRun(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)>, nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)+0x264) [0x7f917f3eacf4] [2021-10-07T01:27:03.835Z] [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::MKLDNNConvolutionForward(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)+0x4f0) [0x7f917f3d6280] [2021-10-07T01:27:03.835Z] [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::MKLDNNConvolutionForwardFullFeature(mxnet::op::MKLDNNConvFullParam const&, mxnet::OpContext const&, mxnet::op::MKLDNNConvForward*, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)+0x580) [0x7f917f3d5540] [2021-10-07T01:27:03.835Z] [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x72) [0x7f917ea45852] [2021-10-07T01:27:03.835Z] File "src/operator/nn/mkldnn/mkldnn_convolution.cc", line 434 [2021-10-07T01:27:03.835Z] MXNetError: Check failed: weight_mem->get_desc() == fwd->GetPd().weights_desc(): [2021-10-07T01:27:03.835Z] -------------------- >> begin captured logging << -------------------- [2021-10-07T01:27:03.835Z] common: WARNING: Error seen with seeded test, use MXNET_TEST_SEED=1188622132 to reproduce. [2021-10-07T01:27:03.835Z] --------------------- >> end captured logging << --------------------- ``` 2. MKLDNN_UTIL_FUNC.MemFormat ``` [2021-10-07T02:39:35.601Z] [----------] 2 tests from MKLDNN_UTIL_FUNC [2021-10-07T02:39:35.601Z] [ RUN ] MKLDNN_UTIL_FUNC.AlignMem [2021-10-07T02:39:35.601Z] [ OK ] MKLDNN_UTIL_FUNC.AlignMem (1 ms) [2021-10-07T02:39:35.601Z] [ RUN ] MKLDNN_UTIL_FUNC.MemFormat [2021-10-07T02:39:35.601Z] unknown file: Failure [2021-10-07T02:39:35.601Z] C++ exception with description "[02:39:59] /work/mxnet/tests/cpp/operator/mkldnn_test.cc:103: Check failed: (dnnl_format_tag_last) == (222) [2021-10-07T02:39:35.601Z] [2021-10-07T02:39:35.601Z] " thrown in the test body. [2021-10-07T02:39:35.601Z] [ FAILED ] MKLDNN_UTIL_FUNC.MemFormat (0 ms) [2021-10-07T02:39:35.601Z] [----------] 2 tests from MKLDNN_UTIL_FUNC (1 ms total) ``` 3. ThreadSafety.CachedOpFullModel ``` [2021-10-07T02:32:31.459Z] [ RUN ] ThreadSafety.CachedOpFullModel [2021-10-07T02:32:31.459Z] [02:32:53] src/nnvm/legacy_json_util.cc:208: Loading symbol saved by previous version v0.8.0. Attempting to upgrade... [2021-10-07T02:32:31.459Z] [02:32:53] src/nnvm/legacy_json_util.cc:216: Symbol successfully upgraded! [2021-10-07T02:32:34.725Z] terminate called after throwing an instance of 'dmlc::Error' [2021-10-07T02:32:34.725Z] what(): [02:32:57] tests/cpp/thread_safety/thread_safety_test.cc:314: MXNetError: Check failed: weight_mem->get_desc() == fwd->GetPd().weights_desc(): [2021-10-07T02:32:34.725Z] Stack trace: [2021-10-07T02:32:34.725Z] File "src/operator/nn/mkldnn/mkldnn_convolution.cc", line 434 [2021-10-07T02:32:34.725Z] [2021-10-07T02:32:34.725Z] [2021-10-07T02:32:34.725Z] [2021-10-07T02:32:34.725Z] /work/runtime_functions.sh: line 1306: 1730 Aborted (core dumped) build/tests/cpp/mxnet_unit_tests --gtest_filter="ThreadSafety.*" ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org