[GitHub] [incubator-mxnet] TaoLv opened a new issue #17341: Segfault of test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker

GitBox Thu, 16 Jan 2020 06:56:59 -0800

TaoLv opened a new issue #17341: Segfault of 
test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker
URL: https://github.com/apache/incubator-mxnet/issues/17341
 
 
   ## Description
   Maybe not that flaky. I met the crash in my MKL-DNN upgrading PR (#17313) 
which seems to be not related to this test.
   Put this issue here to see if anyone else meets the same problem and hope 
someone familiar with threaded engine can take a look.
   
   ## Occurrences
   
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-17313/2/pipeline/299
   
   ## What have you tried to solve it?
   
   Back trace:
   ```
   (gdb) bt
   #0  0x00007f68610b898d in pthread_join (threadid=140079671015168, 
thread_return=0x0) at pthread_join.c:90
   #1  0x00007f68575f4793 in std::thread::join() () from 
target:/usr/lib/x86_64-linux-gnu/libstdc++.so.6
   #2  0x00007f6853460407 in mxnet::engine::ThreadPool::~ThreadPool 
(this=0x20b8ce0, __in_chrg=<optimized out>) at src/engine/./thread_pool.h:84
   #3  std::default_delete<mxnet::engine::ThreadPool>::operator() 
(this=<optimized out>, __ptr=0x20b8ce0) at 
/usr/include/c++/5/bits/unique_ptr.h:76
   #4  std::unique_ptr<mxnet::engine::ThreadPool, 
std::default_delete<mxnet::engine::ThreadPool> >::~unique_ptr (this=0x2c20bf8, 
__in_chrg=<optimized out>) at /usr/include/c++/5/bits/unique_ptr.h:236
   #5  
mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>::~ThreadWorkerBlock
 (this=0x2c20b30, __in_chrg=<optimized out>) at 
src/engine/threaded_engine_perdevice.cc:214
   #6  
std::_Sp_counted_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*,
 (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=<optimized out>)
       at /usr/include/c++/5/bits/shared_ptr_base.h:374
   #7  0x00007f684f65a3ea in 
std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x21e2ce0) 
at /usr/include/c++/5/bits/shared_ptr_base.h:150
   #8  0x00007f685345c50b in 
std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count 
(this=<optimized out>, __in_chrg=<optimized out>) at 
/usr/include/c++/5/bits/shared_ptr_base.h:659
   #9  
std::__shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>,
 (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=<optimized out>, 
__in_chrg=<optimized out>)
       at /usr/include/c++/5/bits/shared_ptr_base.h:925
   #10 
std::__shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>,
 
(__gnu_cxx::_Lock_policy)2>::operator=(std::__shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>,
 (__gnu_cxx::_Lock_policy)2>&&) (__r=<optimized out>, this=<synthetic pointer>) 
at /usr/include/c++/5/bits/shared_ptr_base.h:1000
   #11 
std::shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>
 
>::operator=(std::shared_ptr<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>
 >&&) (__r=<optimized out>, this=<synthetic pointer>) at 
/usr/include/c++/5/bits/shared_ptr.h:294
   #12 
mxnet::common::LazyAllocArray<mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>
 >::Clear (this=this@entry=0x1f230f8) at 
src/engine/../common/lazy_alloc_array.h:149
   #13 0x00007f685345fb2c in mxnet::engine::ThreadedEnginePerDevice::StopNoWait 
(this=0x1f22ff0) at src/engine/threaded_engine_perdevice.cc:67
   #14 mxnet::engine::ThreadedEnginePerDevice::Stop (this=0x1f22ff0) at 
src/engine/threaded_engine_perdevice.cc:74
   #15 0x00007f685357dfb6 in mxnet::LibraryInitializer::atfork_prepare 
(this=<optimized out>) at src/initialize.cc:196
   ```
   
   1. Add `DEBUG=1` to the make line can get rid of the problem;
   2. Did not observe the problem when running the single test or the single 
test file of test_gluon_data.py.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] TaoLv opened a new issue #17341: Segfault of test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker

Reply via email to