samskalicky opened a new pull request #5986: URL: https://github.com/apache/incubator-tvm/pull/5986
Ive been getting this issue when running tests, all pass, and then as the process starts to exit fails with a core dump: ``` pure virtual method called terminate called without an active exception Aborted (core dumped) #5 0x00007ffff11d9988 in __cxxabiv1::__cxa_pure_virtual () at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/pure.cc:50 #6 0x00007fff45589a82 in tvm::runtime::NDArray::Internal::DefaultDeleter (ptr_obj=0x55555754ece0) at /home/ubuntu/NeoMXNet/3rdparty/tvm/src/runtime/ndarray.cc:97 #7 0x00007fff4557d439 in tvm::runtime::Object::DecRef (this=0x55555754ece0) at /home/ubuntu/NeoMXNet/3rdparty/tvm/include/tvm/runtime/object.h:833 #8 0x00007fff455b2815 in tvm::runtime::ObjectPtr<tvm::runtime::Object>::reset (this=0x5555571c8c00) at /home/ubuntu/NeoMXNet/3rdparty/tvm/include/tvm/runtime/object.h:439 #9 0x00007fff45598698 in tvm::runtime::ObjectPtr<tvm::runtime::Object>::~ObjectPtr (this=0x5555571c8c00, __in_chrg=<optimized out>) at /home/ubuntu/NeoMXNet/3rdparty/tvm/include/tvm/runtime/object.h:388 #10 0x00007fff4557d4aa in tvm::runtime::ObjectRef::~ObjectRef (this=0x5555571c8c00, __in_chrg=<optimized out>) at /home/ubuntu/NeoMXNet/3rdparty/tvm/include/tvm/runtime/object.h:511 #11 0x00007fff4557df1e in tvm::runtime::NDArray::~NDArray (this=0x5555571c8c00, __in_chrg=<optimized out>) at /home/ubuntu/NeoMXNet/3rdparty/tvm/include/tvm/runtime/ndarray.h:42 #12 0x00007fff455fafb3 in std::_Destroy<tvm::runtime::NDArray> (__pointer=0x5555571c8c00) at /usr/include/c++/5/bits/stl_construct.h:93 #13 0x00007fff455edee1 in std::_Destroy_aux<false>::__destroy<tvm::runtime::NDArray*> (__first=0x5555571c8c00, __last=0x5555571c8c10) at /usr/include/c++/5/bits/stl_construct.h:103 #14 0x00007fff455dfa22 in std::_Destroy<tvm::runtime::NDArray*> (__first=0x5555571c8c00, __last=0x5555571c8c10) at /usr/include/c++/5/bits/stl_construct.h:126 #15 0x00007fff455cd124 in std::_Destroy<tvm::runtime::NDArray*, tvm::runtime::NDArray> (__first=0x5555571c8c00, __last=0x5555571c8c10) at /usr/include/c++/5/bits/stl_construct.h:151 #16 0x00007fff455e0d81 in std::vector<tvm::runtime::NDArray, std::allocator<tvm::runtime::NDArray> >::~vector ( this=0x55555752d2e8, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/stl_vector.h:424 #17 0x00007fff455e0ec8 in tvm::runtime::GraphRuntime::~GraphRuntime (this=0x55555752d130, __in_chrg=<optimized out>) at /home/ubuntu/NeoMXNet/3rdparty/tvm/src/runtime/graph/graph_runtime.h:73 #18 0x00007fff455e0fb8 in tvm::runtime::GraphRuntime::~GraphRuntime (this=0x55555752d130, __in_chrg=<optimized out>) at /home/ubuntu/NeoMXNet/3rdparty/tvm/src/runtime/graph/graph_runtime.h:73 ``` It looks like theres a race condition in the shutdown sequence in TVM, and an NDArray is trying to be destructed, but the DeviceAPI object has already been destructed, so when it calls FreeDataSpace to free the NDArray memory it runs into the “pure virtual method called” error. I added a destructor to the CUDADeviceAPI class (https://github.com/neo-ai/tvm/blob/dev/src/runtime/cuda/cuda_device_api.cc#L37) with a print statement and was able to confirm that the destructor was being called before the NDArray was destructed. This confirms the root cause, that the CUDA DeviceAPI was destructed before all the NDArrays were destructed (and their underlying memory freed). Basically the issue is that the CUDADeviceAPI singleton class is destructed before all GPU NDArrays are freed. The quick fix is to be able to re-construct the CUDADeviceAPI singleton after being deconstructed so that it can be used to free the remaining GPU NDArrays. The DeviceAPIManager class (https://github.com/apache/incubator-tvm/blob/579da6b771584ff320b9c7edf635b681b2abd0ef/src/runtime/c_runtime_api.cc#L91) is a singleton that maintains a map of DeviceAPI objects for each context (CPU, GPU, etc). The Global API (https://github.com/apache/incubator-tvm/blob/579da6b771584ff320b9c7edf635b681b2abd0ef/src/runtime/c_runtime_api.cc#L107) is the static singleton “get_instance” function. The GetAPI API (https://github.com/apache/incubator-tvm/blob/579da6b771584ff320b9c7edf635b681b2abd0ef/src/runtime/c_runtime_api.cc#L112) is used to get the DeviceAPI object for a particular context type that is looked up in the api_ map. Upon destructionif we clear the api_ array to nullptr (e72b64b) each DeviceAPI object will be reconstructed. Upon reconstruction of the singleton CUDADeviceAPI class, we need to reset the static shared_ptr (3e50586) too. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org