josephevans opened a new issue #19877:
URL: https://github.com/apache/incubator-mxnet/issues/19877


   ## Description
   On the v1.x pipeline, we are seeing the following test failures consistently:
   
   in tests/python/unittest/test_gluon_data.py:
   
   test_multi_worker_dataloader_release_pool
   test_multi_worker_forked_data_loader
   
   ## Occurrences
   
https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-19872/7/pipeline/293/#step-776-log-1725
   
https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-19872/4/pipeline/296
   
   Test failure logs:
   ```
   [2021-02-10T01:39:46.205Z] 
test_gluon_data.test_multi_worker_dataloader_release_pool ... terminate called 
after throwing an instance of 'dmlc::Error'
   [2021-02-10T01:39:46.205Z]   what():  [01:39:41] 
src/storage/./cpu_shared_storage_manager.h:218: Check failed: count >= 0 (-2 
vs. 0) : 
   [2021-02-10T01:39:46.205Z] Stack trace:
   [2021-02-10T01:39:46.205Z]   [bt] (0) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x61)
 [0x7f191fc63b61]
   [2021-02-10T01:39:46.205Z]   [bt] (1) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::storage::CPUSharedStorageManager::FreeImpl(mxnet::Storage::Handle
 const&)+0xd3) [0x7f192522fdf3]
   [2021-02-10T01:39:46.205Z]   [bt] (2) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::storage::CPUSharedStorageManager::Free(mxnet::Storage::Handle)+0x98)
 [0x7f1925237348]
   [2021-02-10T01:39:46.205Z]   [bt] (3) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::StorageImpl::Free(mxnet::Storage::Handle)+0x69)
 [0x7f1925232ce9]
   [2021-02-10T01:39:46.205Z]   [bt] (4) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x5ade409) [0x7f1924b21409]
   [2021-02-10T01:39:46.205Z]   [bt] (5) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x61d3c50) [0x7f1925216c50]
   [2021-02-10T01:39:46.205Z]   [bt] (6) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext,
 mxnet::engine::OprBlock*)+0xa50) [0x7f1925210440]
   [2021-02-10T01:39:46.205Z]   [bt] (7) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*,
 bool)+0x349) [0x7f192522c9d9]
   [2021-02-10T01:39:46.205Z]   [bt] (8) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::Push(mxnet::engine::Opr*,
 mxnet::Context, int, bool)+0x42b) [0x7f1925219f5b]
   [2021-02-10T01:39:46.205Z]   [bt] (9) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::PushAsync(std::function<void
 (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, 
std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, 
std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, 
mxnet::FnProperty, int, char const*, bool)+0xd8) [0x7f1925216948]
   [2021-02-10T01:39:46.461Z] /work/runtime_functions.sh: line 1008:     6 
Aborted                 (core dumped) nosetests-3.4 $NOSE_COVERAGE_ARGUMENTS 
$NOSE_TIMER_ARGUMENTS --with-xunit --xunit-file nosetests_unittest.xml 
--verbose 
   ```
   
   ```
   [2021-02-09T22:11:59.574Z] 
======================================================================
   [2021-02-09T22:11:59.574Z] ERROR: 
test_gluon_data.test_multi_worker_forked_data_loader
   [2021-02-09T22:11:59.574Z] 
----------------------------------------------------------------------
   [2021-02-09T22:11:59.574Z] Traceback (most recent call last):
   [2021-02-09T22:11:59.574Z]   File 
"/usr/local/lib/python3.7/dist-packages/nose/case.py", line 198, in runTest
   [2021-02-09T22:11:59.574Z]     self.test(*self.arg)
   [2021-02-09T22:11:59.574Z]   File 
"/work/mxnet/tests/python/unittest/common.py", line 226, in test_new
   [2021-02-09T22:11:59.574Z]     mx.nd.waitall()
   [2021-02-09T22:11:59.574Z]   File 
"/work/mxnet/python/mxnet/ndarray/ndarray.py", line 211, in waitall
   [2021-02-09T22:11:59.574Z]     check_call(_LIB.MXNDArrayWaitAll())
   [2021-02-09T22:11:59.574Z]   File "/work/mxnet/python/mxnet/base.py", line 
246, in check_call
   [2021-02-09T22:11:59.574Z]     raise get_last_ffi_error()
   [2021-02-09T22:11:59.574Z] mxnet.base.MXNetError: Traceback (most recent 
call last):
   [2021-02-09T22:11:59.574Z]   [bt] (9) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::PushAsync(std::function<void
 (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, 
std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, 
std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, 
mxnet::FnProperty, int, char const*, bool)+0xd8) [0x7f0df6da1c48]
   [2021-02-09T22:11:59.574Z]   [bt] (8) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::Push(mxnet::engine::Opr*,
 mxnet::Context, int, bool)+0x42b) [0x7f0df6da525b]
   [2021-02-09T22:11:59.574Z]   [bt] (7) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*,
 bool)+0x349) [0x7f0df6db7e69]
   [2021-02-09T22:11:59.574Z]   [bt] (6) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext,
 mxnet::engine::OprBlock*)+0xa50) [0x7f0df6d9b740]
   [2021-02-09T22:11:59.574Z]   [bt] (5) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x63dbf50) [0x7f0df6da1f50]
   [2021-02-09T22:11:59.574Z]   [bt] (4) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x5cde545) [0x7f0df66a4545]
   [2021-02-09T22:11:59.574Z]   [bt] (3) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::StorageImpl::Free(mxnet::Storage::Handle)+0x69)
 [0x7f0df6dbe0b9]
   [2021-02-09T22:11:59.574Z]   [bt] (2) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::storage::CPUSharedStorageManager::Free(mxnet::Storage::Handle)+0x98)
 [0x7f0df6dc2718]
   [2021-02-09T22:11:59.574Z]   [bt] (1) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::storage::CPUSharedStorageManager::FreeImpl(mxnet::Storage::Handle
 const&)+0xcf) [0x7f0df6dbb27f]
   [2021-02-09T22:11:59.574Z]   [bt] (0) 
/work/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x61)
 [0x7f0df16c59e1]
   [2021-02-09T22:11:59.574Z]   File 
"src/storage/./cpu_shared_storage_manager.h", line 218
   [2021-02-09T22:11:59.574Z] MXNetError: Check failed: count >= 0 (-1 vs. 0) : 
   ```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to