Lizonghang opened a new issue #16186: Segmentation fault when calling "auto 
&updates = update_buf_[key];"
URL: https://github.com/apache/incubator-mxnet/issues/16186
 
 
   ## Description
   Hello, I divided the standard `DataHandleDefault` interface in 
kvstore_dist_server.h into `DataHandleSyncDefault` and `DataHandleAsyncDefault` 
to support the synchrouous mode and asynchronous mode respectively. The 
`DataHandleSyncDefault` interface works fine but the `DataHandleAsyncDefault` 
interface suffered from segmentation fault, which is caused by `auto &updates = 
update_buf_[key];`. I wonder what may cause the faults and how to fix them, 
thanks!
   
   ## Environment info (Required)
   
   The followings are codes of the interface `DataHandleAsyncDefault`, most are 
similar to the origin interface `DataHandleDefault`:
   
   ```
   void DataHandleAsyncDefault(const DataHandleType type, const ps::KVMeta& 
req_meta,
                                                       const ps::KVPairs<char> 
&req_data,
                                                       ps::KVServer<char>* 
server) {
       // do some check
       CHECK_EQ(req_data.keys.size(), (size_t)1);
       if (req_meta.push) {
         CHECK_EQ(req_data.lens.size(), (size_t)1);
         CHECK_EQ(req_data.vals.size(), (size_t)req_data.lens[0]);
       }
       CHECK(ps::IsGlobalServer() && !sync_global_mode_);
       int key = DecodeKey(req_data.keys[0]);
       auto& stored = has_multi_precision_copy(type) ? store_realt_[key] : 
store_[key];
       if (req_meta.push) {
         size_t ds[] = {(size_t) req_data.lens[0] / 
mshadow::mshadow_sizeof(type.dtype)};
         TShape dshape(ds, ds + 1);
         TBlob recv_blob;
         MSHADOW_REAL_TYPE_SWITCH(type.dtype, DType, {
           recv_blob = TBlob(reinterpret_cast<DType*>(req_data.vals.data()), 
dshape, cpu::kDevMask);
         });
         NDArray recved = NDArray(recv_blob, 0);
         if (stored.is_none()) {
           // initialization by master worker
           stored = NDArray(dshape, Context(), false,
                            has_multi_precision_copy(type) ? mshadow::kFloat32 
: type.dtype);
           CopyFromTo(recved, &stored, 0);
           if (has_multi_precision_copy(type)) {
             auto &stored_dtype = store_[key];
             stored_dtype = NDArray(dshape, Context(), false, type.dtype);
             CopyFromTo(stored, stored_dtype);
             stored_dtype.WaitToRead();
           }
           stored.WaitToRead();
           server->Response(req_meta);
           auto len = stored.shape().Size() * 
mshadow::mshadow_sizeof(stored.dtype());
           ps::KVPairs<char>broadcast_data;
           broadcast_data.keys.push_back(req_data.keys[0]);
           broadcast_data.lens = {len};
           broadcast_data.vals.CopyFrom(static_cast<const 
char*>(stored.data().dptr_), len);
           server->Broadcast(broadcast_data);
         } else {
           auto &updates = update_buf_[key];
           if (has_multi_precision_copy(type) && updates.temp_array.is_none()) {
             updates.temp_array = NDArray(dshape, Context(), false, 
mshadow::kFloat32);
           }
           if (updates.request.empty()) {
             if (has_multi_precision_copy(type)) {
               CopyFromTo(recved, updates.temp_array);
             } else {
               updates.temp_array = recved;
             }
           updates.request.push_back(req_meta);
           ApplyUpdates(type, key, req_data, &updates, server);
         }
       }
     }
   }
   
   inline void ApplyUpdates(const DataHandleType type, const int key,
                                              const ps::KVPairs<char>& 
req_data, UpdateBuf *update_buf,
                                              ps::KVServer<char>* server) {
       auto& stored = has_multi_precision_copy(type) ? store_realt_[key] : 
store_[key];
       auto& update = update_buf->temp_array;
       exec_.Exec([this, key, &update, &stored](){
         CHECK(updater_);
         updater_(key, update, &stored);
       });
       for (const auto& req : update_buf->request) {
         DefaultStorageResponse(type, key, req, req_data, server, true);
       }
       update_buf->request.clear();
       if (has_multi_precision_copy(type)) CopyFromTo(stored, store_[key]);
       stored.WaitToRead();
   }
   ```
   
   The INITIALIZATION works fine and segmentation fault occurred in the PUSH 
stage when calling `auto &updates = update_buf_[key];`. Strangely, faults 
usually occurred at the 2th or 3th training rounds (when 
`DataHandleAsyncDefault` is called about 60 times), and sometimes faults did 
not occur.
   
   ## Error Message:
   I run the server process in the gdb environment (in docker container, 24GB 
memory, 12 CPUs, 6GB memory swap, unlimited shm size), gdb reported the 
following information:
   
   ```
   (gdb) ......
   Thread 70 "python" received signal SIGSEGV, Segmentation fault.
   [Switching to Thread 0x7fff78911700 (LWP 22644)]
   0x00007fffd7cddb2f in std::__detail::_Map_base<int, std::pair<int const, 
mxnet::kvstore::KVStoreDistServer::UpdateBuf>, std::allocator<std::pair<int 
const, mxnet::kvstore::KVStoreDistServer::UpdateBuf> >, 
std::__detail::_Select1st, std::equal_to<int>, std::hash<int>, 
std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, 
std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, 
false, true>, true>::operator[](int const&) () from /root/HiPS/lib/libmxnet.so
   (gdb) bt
   #0  0x00007fffd7cddb2f in std::__detail::_Map_base<int, std::pair<int const, 
mxnet::kvstore::KVStoreDistServer::UpdateBuf>, std::allocator<std::pair<int 
const, mxnet::kvstore::KVStoreDistServer::UpdateBuf> >, 
std::__detail::_Select1st, std::equal_to<int>, std::hash<int>, 
std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, 
std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, 
false, true>, true>::operator[](int const&) () from /root/HiPS/lib/libmxnet.so
   #1  0x00007fffd7d14060 in 
mxnet::kvstore::KVStoreDistServer::DataHandleAsyncDefault(mxnet::kvstore::DataHandleType,
 ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*) () from 
/root/HiPS/lib/libmxnet.so
   #2  0x00007fffd7d16d31 in 
mxnet::kvstore::KVStoreDistServer::DataHandleEx(ps::KVMeta const&, 
ps::KVPairs<char> const&, ps::KVServer<char>*) () from 
/root/HiPS/lib/libmxnet.so
   #3  0x00007fffd7d1103a in ps::KVServer<char>::Process(ps::Message const&) () 
from /root/HiPS/lib/libmxnet.so
   #4  0x00007fffd7d53cfc in std::function<void (ps::Message 
const&)>::operator()(ps::Message const&) const (__args#0=...,
       this=0x555556a462b8) at /usr/include/c++/5/functional:2267
   #5  ps::Customer::Receiving (this=0x555556a462b0) at src/customer.cc:62
   #6  0x00007fffa6d6f421 in std::execute_native_thread_routine_compat 
(__p=<optimized out>)
       at 
/home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/thread.cc:94
   #7  0x00007ffff7bc16ba in start_thread (arg=0x7fff78911700) at 
pthread_create.c:333
   #8  0x00007ffff78f741d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:109
   Warning: the current language does not match this frame.
   ```
   
   ## What have you tried to solve it?
   
   1. I tried to discard `update_buf_` and apply `recved` directly to update 
the `store_`, but similar faults occurred when calling` auto& stored = 
has_multi_precision_copy(type) ? store_realt_[key] : store_[key];`.
   2. I think maybe Docker caused the faults and tried on physical machines but 
faults still exist.
   3. I found that sometimes other faults also occurred, and I feel like 
something is wrong with the memory allocation. I have struggled to find the 
causes for 5 days but unfortunately failed to figure out what is wrong.
   
   a) updates.request.push_back(req_meta);
   ```
   Thread 70 "python" received signal SIGSEGV, Segmentation fault.
   [Switching to Thread 0x7fff7390f700 (LWP 23473)]
   0x00007fffd7cd34e2 in std::vector<ps::KVMeta, std::allocator<ps::KVMeta> 
>::push_back(ps::KVMeta const&) ()
      from /root/HiPS/lib/libmxnet.so
   (gdb) bt
   #0  0x00007fffd7cd34e2 in std::vector<ps::KVMeta, std::allocator<ps::KVMeta> 
>::push_back(ps::KVMeta const&) ()
      from /root/HiPS/lib/libmxnet.so
   #1  0x00007fffd7d14d30 in 
mxnet::kvstore::KVStoreDistServer::DataHandleAsyncDefault(mxnet::kvstore::DataHandleType,
 ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*) () from 
/root/HiPS/lib/libmxnet.so
   #2  0x00007fffd7d16d31 in 
mxnet::kvstore::KVStoreDistServer::DataHandleEx(ps::KVMeta const&, 
ps::KVPairs<char> const&, ps::KVServer<char>*) () from 
/root/HiPS/lib/libmxnet.so
   #3  0x00007fffd7d1103a in ps::KVServer<char>::Process(ps::Message const&) () 
from /root/HiPS/lib/libmxnet.so
   #4  0x00007fffd7d53cfc in std::function<void (ps::Message 
const&)>::operator()(ps::Message const&) const (__args#0=...,
       this=0x555556b63e38) at /usr/include/c++/5/functional:2267
   #5  ps::Customer::Receiving (this=0x555556b63e30) at src/customer.cc:62
   #6  0x00007fffa6d6f421 in std::execute_native_thread_routine_compat 
(__p=<optimized out>)
       at 
/home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/thread.cc:94
   #7  0x00007ffff7bc16ba in start_thread (arg=0x7fff7390f700) at 
pthread_create.c:333
   #8  0x00007ffff78f741d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:109
   ```
   
   b) updates.temp_array = recved;
   ```
   Thread 70 "python" received signal SIGSEGV, Segmentation fault.
   [Switching to Thread 0x7fff7390b700 (LWP 23591)]
   0x00007fffd4a301d3 in 
std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from 
/root/HiPS/lib/libmxnet.so
   (gdb) bt
   #0  0x00007fffd4a301d3 in 
std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from 
/root/HiPS/lib/libmxnet.so
   #1  0x00007fffd747d6e5 in mxnet::NDArray::operator=(mxnet::NDArray const&) 
() from /root/HiPS/lib/libmxnet.so
   #2  0x00007fffd7d14e68 in 
mxnet::kvstore::KVStoreDistServer::DataHandleAsyncDefault(mxnet::kvstore::DataHandleType,
 ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*) () from 
/root/HiPS/lib/libmxnet.so
   #3  0x00007fffd7d16d31 in 
mxnet::kvstore::KVStoreDistServer::DataHandleEx(ps::KVMeta const&, 
ps::KVPairs<char> const&, ps::KVServer<char>*) () from 
/root/HiPS/lib/libmxnet.so
   #4  0x00007fffd7d1103a in ps::KVServer<char>::Process(ps::Message const&) () 
from /root/HiPS/lib/libmxnet.so
   #5  0x00007fffd7d53cfc in std::function<void (ps::Message 
const&)>::operator()(ps::Message const&) const (__args#0=...,
       this=0x555556a462b8) at /usr/include/c++/5/functional:2267
   #6  ps::Customer::Receiving (this=0x555556a462b0) at src/customer.cc:62
   #7  0x00007fffa6d6f421 in std::execute_native_thread_routine_compat 
(__p=<optimized out>)
       at 
/home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/thread.cc:94
   #8  0x00007ffff7bc16ba in start_thread (arg=0x7fff7390b700) at 
pthread_create.c:333
   #9  0x00007ffff78f741d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:109
   ```
   
   c) DefaultStorageResponse(type, key, req, req_data, server, true);
   ```
   Thread 70 "python" received signal SIGSEGV, Segmentation fault.
   [Switching to Thread 0x7fff78911700 (LWP 24459)]
   __memcpy_avx_unaligned () at 
../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:238
   238  ../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S: No such file or 
directory.
   (gdb) bt
   #0  __memcpy_avx_unaligned () at 
../sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S:238
   #1  0x00007fffd7cf94f1 in 
mxnet::kvstore::KVStoreDistServer::DefaultStorageResponse(mxnet::kvstore::DataHandleType,
 int, ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*, bool) 
() from /root/HiPS/lib/libmxnet.so
   #2  0x00007fffd7cf97e7 in 
mxnet::kvstore::KVStoreDistServer::ApplyUpdates(mxnet::kvstore::DataHandleType, 
int, ps::KVPairs<char> const&, mxnet::kvstore::KVStoreDistServer::UpdateBuf*, 
ps::KVServer<char>*) () from /root/HiPS/lib/libmxnet.so
   #3  0x00007fffd7d14d9f in 
mxnet::kvstore::KVStoreDistServer::DataHandleAsyncDefault(mxnet::kvstore::DataHandleType,
 ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*) () from 
/root/HiPS/lib/libmxnet.so
   #4  0x00007fffd7d16d31 in 
mxnet::kvstore::KVStoreDistServer::DataHandleEx(ps::KVMeta const&, 
ps::KVPairs<char> const&, ps::KVServer<char>*) () from 
/root/HiPS/lib/libmxnet.so
   #5  0x00007fffd7d1103a in ps::KVServer<char>::Process(ps::Message const&) () 
from /root/HiPS/lib/libmxnet.so
   #6  0x00007fffd7d53cfc in std::function<void (ps::Message 
const&)>::operator()(ps::Message const&) const (__args#0=...,
       this=0x555556a462b8) at /usr/include/c++/5/functional:2267
   #7  ps::Customer::Receiving (this=0x555556a462b0) at src/customer.cc:62
   #8  0x00007fffa6d6f421 in std::execute_native_thread_routine_compat 
(__p=<optimized out>)
       at 
/home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/src/c++11/thread.cc:94
   #9  0x00007ffff7bc16ba in start_thread (arg=0x7fff78911700) at 
pthread_create.c:333
   #10 0x00007ffff78f741d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:109
   ```
   
   I have been in a mess now and very thanks for your help.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to