josephevans opened a new issue #20643:
URL: https://github.com/apache/incubator-mxnet/issues/20643


   ## Description
   The following tests keep failing consistently in the v1.9.x branch:
   
   1. tests/python/unittest/test_gluon.py/test_gluon.py - 
test_hybrid_static_memory_switching
   2. tests/cpp/operator/mkldnn_test.cc:103 - MKLDNN_UTIL_FUNC.MemFormat
   3. tests/cpp/thread_safety/thread_safety_test.cc:314 - 
ThreadSafety.CachedOpFullModel
   
   ## Occurrences
   
   1. test_gluon.test_hybrid_static_memory_switching examples:
     - Python3: MKLDNN-MKL-CPU stage: 
https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20626/16/pipeline/295
     - Python3: MKLDNN-CPU stage: 
https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-20626/16/pipeline/294
   2. MKLDNN_UTIL_FUNC.MemFormat examples:
     - Cpp: MKLDNN+GPU stage: 
https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-20626/14/pipeline
   3. ThreadSafety.CachedOpFullModel examples:
     - capi-cpp-package GPU Makefile stage: 
https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-20626/14/pipeline/366
   
   ## Test Failure Log Output
   
   1. test_gluon.test_hybrid_static_memory_switching
   ```
   [2021-10-07T01:27:03.835Z] 
======================================================================
   [2021-10-07T01:27:03.835Z] ERROR: 
test_gluon.test_hybrid_static_memory_switching
   [2021-10-07T01:27:03.835Z] 
----------------------------------------------------------------------
   [2021-10-07T01:27:03.835Z] Traceback (most recent call last):
   [2021-10-07T01:27:03.835Z]   File 
"/usr/local/lib/python3.7/dist-packages/nose/case.py", line 198, in runTest
   [2021-10-07T01:27:03.835Z]     self.test(*self.arg)
   [2021-10-07T01:27:03.835Z]   File 
"/work/mxnet/tests/python/unittest/common.py", line 218, in test_new
   [2021-10-07T01:27:03.835Z]     orig_test(*args, **kwargs)
   [2021-10-07T01:27:03.835Z]   File 
"/work/mxnet/tests/python/unittest/test_gluon.py", line 1760, in 
test_hybrid_static_memory_switching
   [2021-10-07T01:27:03.835Z]     
check_hybrid_static_memory_switching(static_alloc=True)
   [2021-10-07T01:27:03.835Z]   File 
"/work/mxnet/tests/python/unittest/test_gluon.py", line 1755, in 
check_hybrid_static_memory_switching
   [2021-10-07T01:27:03.835Z]     mx.nd.waitall()
   [2021-10-07T01:27:03.835Z]   File 
"/work/mxnet/python/mxnet/ndarray/ndarray.py", line 211, in waitall
   [2021-10-07T01:27:03.835Z]     check_call(_LIB.MXNDArrayWaitAll())
   [2021-10-07T01:27:03.835Z]   File "/work/mxnet/python/mxnet/base.py", line 
246, in check_call
   [2021-10-07T01:27:03.835Z]     raise get_last_ffi_error()
   [2021-10-07T01:27:03.835Z] mxnet.base.MXNetError: Traceback (most recent 
call last):
   [2021-10-07T01:27:03.835Z]   [bt] (9) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void 
(std::shared_ptr<dmlc::ManualEvent>), 
mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, 
bool)::{lambda()#1}::operator()() 
const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data
 const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x147) [0x7f9183955ee7]
   [2021-10-07T01:27:03.835Z]   [bt] (8) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext,
 mxnet::engine::OprBlock*)+0x2d8) [0x7f9183942738]
   [2021-10-07T01:27:03.835Z]   [bt] (7) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void 
(mxnet::RunContext, mxnet::engine::CallbackOnComplete), 
mxnet::engine::ThreadedEngine::BulkFlush()::{lambda(mxnet::RunContext, 
mxnet::engine::CallbackOnComplete)#1}>::_M_invoke(std::_Any_data const&, 
mxnet::RunContext&&, mxnet::engine::CallbackOnComplete&&)+0x1c6) 
[0x7f9183940056]
   [2021-10-07T01:27:03.835Z]   [bt] (6) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void 
(mxnet::RunContext), mxnet::imperative::PushFComputeEx(std::function<void 
(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, 
std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, 
std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, 
std::allocator<mxnet::NDArray> > const&)> const&, nnvm::Op const*, 
nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, 
std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, 
std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, 
std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, 
std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, 
std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::OpReqType, 
std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}
 >::_M_invoke(std::_Any_data const&, mxnet::RunContext&&)+0x17) [0x7f91838680d7]
   [2021-10-07T01:27:03.835Z]   [bt] (5) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::PushFComputeEx(std::function<void
 (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, 
std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, 
std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, 
std::allocator<mxnet::NDArray> > const&)> const&, nnvm::Op const*, 
nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, 
std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, 
std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, 
std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, 
std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, 
std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::OpReqType, 
std::allocator<mxnet::OpReqType> > 
const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) 
const+0x293) [0x7
 f9183867f43]
   [2021-10-07T01:27:03.835Z]   [bt] (4) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x5658020) [0x7f91835bb020]
   [2021-10-07T01:27:03.835Z]   [bt] (3) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::MKLDNNRun(std::function<void
 (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, 
std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, 
std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, 
std::allocator<mxnet::NDArray> > const&)>, nnvm::NodeAttrs const&, 
mxnet::OpContext const&, std::vector<mxnet::NDArray, 
std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, 
std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, 
std::allocator<mxnet::NDArray> > const&)+0x264) [0x7f917f3eacf4]
   [2021-10-07T01:27:03.835Z]   [bt] (2) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::MKLDNNConvolutionForward(nnvm::NodeAttrs
 const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, 
std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::OpReqType, 
std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::NDArray, 
std::allocator<mxnet::NDArray> > const&)+0x4f0) [0x7f917f3d6280]
   [2021-10-07T01:27:03.835Z]   [bt] (1) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::MKLDNNConvolutionForwardFullFeature(mxnet::op::MKLDNNConvFullParam
 const&, mxnet::OpContext const&, mxnet::op::MKLDNNConvForward*, 
std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, 
std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, 
std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&)+0x580) 
[0x7f917f3d5540]
   [2021-10-07T01:27:03.835Z]   [bt] (0) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x72)
 [0x7f917ea45852]
   [2021-10-07T01:27:03.835Z]   File 
"src/operator/nn/mkldnn/mkldnn_convolution.cc", line 434
   [2021-10-07T01:27:03.835Z] MXNetError: Check failed: weight_mem->get_desc() 
== fwd->GetPd().weights_desc(): 
   [2021-10-07T01:27:03.835Z] -------------------- >> begin captured logging << 
--------------------
   [2021-10-07T01:27:03.835Z] common: WARNING: Error seen with seeded test, use 
MXNET_TEST_SEED=1188622132 to reproduce.
   [2021-10-07T01:27:03.835Z] --------------------- >> end captured logging << 
---------------------
   ```
   2. MKLDNN_UTIL_FUNC.MemFormat
   ```
   [2021-10-07T02:39:35.601Z] [----------] 2 tests from MKLDNN_UTIL_FUNC
   [2021-10-07T02:39:35.601Z] [ RUN      ] MKLDNN_UTIL_FUNC.AlignMem
   [2021-10-07T02:39:35.601Z] [       OK ] MKLDNN_UTIL_FUNC.AlignMem (1 ms)
   [2021-10-07T02:39:35.601Z] [ RUN      ] MKLDNN_UTIL_FUNC.MemFormat
   [2021-10-07T02:39:35.601Z] unknown file: Failure
   [2021-10-07T02:39:35.601Z] C++ exception with description "[02:39:59] 
/work/mxnet/tests/cpp/operator/mkldnn_test.cc:103: Check failed: 
(dnnl_format_tag_last) == (222) 
   [2021-10-07T02:39:35.601Z] 
   [2021-10-07T02:39:35.601Z] " thrown in the test body.
   [2021-10-07T02:39:35.601Z] [  FAILED  ] MKLDNN_UTIL_FUNC.MemFormat (0 ms)
   [2021-10-07T02:39:35.601Z] [----------] 2 tests from MKLDNN_UTIL_FUNC (1 ms 
total)
   ```
   
   3. ThreadSafety.CachedOpFullModel
   ```
   [2021-10-07T02:32:31.459Z] [ RUN      ] ThreadSafety.CachedOpFullModel
   [2021-10-07T02:32:31.459Z] [02:32:53] src/nnvm/legacy_json_util.cc:208: 
Loading symbol saved by previous version v0.8.0. Attempting to upgrade...
   [2021-10-07T02:32:31.459Z] [02:32:53] src/nnvm/legacy_json_util.cc:216: 
Symbol successfully upgraded!
   [2021-10-07T02:32:34.725Z] terminate called after throwing an instance of 
'dmlc::Error'
   [2021-10-07T02:32:34.725Z]   what():  [02:32:57] 
tests/cpp/thread_safety/thread_safety_test.cc:314: MXNetError: Check failed: 
weight_mem->get_desc() == fwd->GetPd().weights_desc(): 
   [2021-10-07T02:32:34.725Z] Stack trace:
   [2021-10-07T02:32:34.725Z]   File 
"src/operator/nn/mkldnn/mkldnn_convolution.cc", line 434
   [2021-10-07T02:32:34.725Z] 
   [2021-10-07T02:32:34.725Z] 
   [2021-10-07T02:32:34.725Z] 
   [2021-10-07T02:32:34.725Z] /work/runtime_functions.sh: line 1306:  1730 
Aborted                 (core dumped) build/tests/cpp/mxnet_unit_tests 
--gtest_filter="ThreadSafety.*"
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org

Reply via email to