samskalicky opened a new pull request #5986:
URL: https://github.com/apache/incubator-tvm/pull/5986


   Ive been getting this issue when running tests, all pass, and then as the 
process starts to exit fails with a core dump:
   ```
   pure virtual method called
   terminate called without an active exception
   Aborted (core dumped)
   
   #5  0x00007ffff11d9988 in __cxxabiv1::__cxa_pure_virtual ()
       at 
/home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/pure.cc:50
   #6  0x00007fff45589a82 in tvm::runtime::NDArray::Internal::DefaultDeleter 
(ptr_obj=0x55555754ece0)
       at /home/ubuntu/NeoMXNet/3rdparty/tvm/src/runtime/ndarray.cc:97
   #7  0x00007fff4557d439 in tvm::runtime::Object::DecRef (this=0x55555754ece0)
       at /home/ubuntu/NeoMXNet/3rdparty/tvm/include/tvm/runtime/object.h:833
   #8  0x00007fff455b2815 in 
tvm::runtime::ObjectPtr<tvm::runtime::Object>::reset (this=0x5555571c8c00)
       at /home/ubuntu/NeoMXNet/3rdparty/tvm/include/tvm/runtime/object.h:439
   #9  0x00007fff45598698 in 
tvm::runtime::ObjectPtr<tvm::runtime::Object>::~ObjectPtr (this=0x5555571c8c00, 
       __in_chrg=<optimized out>) at 
/home/ubuntu/NeoMXNet/3rdparty/tvm/include/tvm/runtime/object.h:388
   #10 0x00007fff4557d4aa in tvm::runtime::ObjectRef::~ObjectRef 
(this=0x5555571c8c00, __in_chrg=<optimized out>)
       at /home/ubuntu/NeoMXNet/3rdparty/tvm/include/tvm/runtime/object.h:511
   #11 0x00007fff4557df1e in tvm::runtime::NDArray::~NDArray 
(this=0x5555571c8c00, __in_chrg=<optimized out>)
       at /home/ubuntu/NeoMXNet/3rdparty/tvm/include/tvm/runtime/ndarray.h:42
   #12 0x00007fff455fafb3 in std::_Destroy<tvm::runtime::NDArray> 
(__pointer=0x5555571c8c00)
       at /usr/include/c++/5/bits/stl_construct.h:93
   #13 0x00007fff455edee1 in 
std::_Destroy_aux<false>::__destroy<tvm::runtime::NDArray*> 
(__first=0x5555571c8c00, 
       __last=0x5555571c8c10) at /usr/include/c++/5/bits/stl_construct.h:103
   #14 0x00007fff455dfa22 in std::_Destroy<tvm::runtime::NDArray*> 
(__first=0x5555571c8c00, __last=0x5555571c8c10)
       at /usr/include/c++/5/bits/stl_construct.h:126
   #15 0x00007fff455cd124 in std::_Destroy<tvm::runtime::NDArray*, 
tvm::runtime::NDArray> (__first=0x5555571c8c00, 
       __last=0x5555571c8c10) at /usr/include/c++/5/bits/stl_construct.h:151
   #16 0x00007fff455e0d81 in std::vector<tvm::runtime::NDArray, 
std::allocator<tvm::runtime::NDArray> >::~vector (
       this=0x55555752d2e8, __in_chrg=<optimized out>) at 
/usr/include/c++/5/bits/stl_vector.h:424
   #17 0x00007fff455e0ec8 in tvm::runtime::GraphRuntime::~GraphRuntime 
(this=0x55555752d130, 
       __in_chrg=<optimized out>) at 
/home/ubuntu/NeoMXNet/3rdparty/tvm/src/runtime/graph/graph_runtime.h:73
   #18 0x00007fff455e0fb8 in tvm::runtime::GraphRuntime::~GraphRuntime 
(this=0x55555752d130, 
       __in_chrg=<optimized out>) at 
/home/ubuntu/NeoMXNet/3rdparty/tvm/src/runtime/graph/graph_runtime.h:73
   ```
   
   It looks like theres a race condition in the shutdown sequence in TVM, and 
an NDArray is trying to be destructed, but the DeviceAPI object has already 
been destructed, so when it calls FreeDataSpace to free the NDArray memory it 
runs into the “pure virtual method called” error.
   
   I added a destructor to the CUDADeviceAPI class 
(https://github.com/neo-ai/tvm/blob/dev/src/runtime/cuda/cuda_device_api.cc#L37)
 with a print statement and was able to confirm that the destructor was being 
called before the NDArray was destructed. This confirms the root cause, that 
the CUDA DeviceAPI was destructed before all the NDArrays were destructed (and 
their underlying memory freed).
   
   Basically the issue is that the CUDADeviceAPI singleton class is destructed 
before all GPU NDArrays are freed. The quick fix is to be able to re-construct 
the CUDADeviceAPI singleton after being deconstructed so that it can be used to 
free the remaining GPU NDArrays.
   
   The DeviceAPIManager class 
(https://github.com/apache/incubator-tvm/blob/579da6b771584ff320b9c7edf635b681b2abd0ef/src/runtime/c_runtime_api.cc#L91)
 is a singleton that maintains a map of DeviceAPI objects for each context 
(CPU, GPU, etc). The Global API 
(https://github.com/apache/incubator-tvm/blob/579da6b771584ff320b9c7edf635b681b2abd0ef/src/runtime/c_runtime_api.cc#L107)
 is the static singleton “get_instance” function. The GetAPI API 
(https://github.com/apache/incubator-tvm/blob/579da6b771584ff320b9c7edf635b681b2abd0ef/src/runtime/c_runtime_api.cc#L112)
 is used to get the DeviceAPI object for a particular context type that is 
looked up in the api_ map. Upon destructionif we clear the api_ array to 
nullptr (e72b64b) each DeviceAPI object will be reconstructed. Upon 
reconstruction of the singleton CUDADeviceAPI class, we need to reset the 
static shared_ptr (3e50586) too.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to