[GitHub] [incubator-mxnet] BlakeLazarine opened a new issue, #21118: Issue when running Distributed Training with Sparse Gradients

GitBox Tue, 09 Aug 2022 15:12:18 -0700


BlakeLazarine opened a new issue, #21118:
URL: https://github.com/apache/incubator-mxnet/issues/21118


   ## Description
   (A clear and concise description of what the bug is.)
   When running distributed training (multi-instance with each instance having 
a single GPU) with sparse gradients (produced by negative sampling), MXNet 
crashes. I have implementations that use either Synchronous or Asynchronous 
Parameter server implementations or use Horovod for distributed training. All 
of these implementations are able to train on datasets which do not use sparse 
gradients and the Horovod implementation successfully trains when there is only 
1 instance.
   
   The Async case seems to be the most descriptive of the issue, but I am still 
unable to make use of it. 
   
   ### Error Message
   
   Error message from Async PS implementation
   ```
   mxnet.base.MXNetError: Traceback (most recent call last):
     [bt] (8) /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fa4c8b95133]
     [bt] (7) /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) 
[0x7fa4c8a5b609]
     [bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd7172) 
[0x7fa447ade172]
     [bt] (5) 
/usr/local/lib/python3.8/dist-packages/mxnet/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void
 (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> 
>::_M_run()+0x3b) [0x7fa4813459fb]
     [bt] (4) 
/usr/local/lib/python3.8/dist-packages/mxnet/libmxnet.so(std::_Function_handler<void
 (std::shared_ptr<dmlc::ManualEvent>), 
mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, 
bool)::{lambda()#1}::operator()() 
const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data
 const&, std::shared_ptr<dmlc::ManualEvent>)+0x104) [0x7fa481348624]
     [bt] (3) 
/usr/local/lib/python3.8/dist-packages/mxnet/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext,
 mxnet::engine::OprBlock*)+0x48b) [0x7fa48134694b]
     [bt] (2) 
/usr/local/lib/python3.8/dist-packages/mxnet/libmxnet.so(mxnet::kvstore::KVStoreDist::PullDefault(int,
 mxnet::NDArray const&, int)::{lambda(mxnet::RunContext, 
mxnet::engine::CallbackOnComplete)#1}::operator()(mxnet::RunContext, 
mxnet::engine::CallbackOnComplete) const+0x5c) [0x7fa48151382c]
     [bt] (1) 
/usr/local/lib/python3.8/dist-packages/mxnet/libmxnet.so(mxnet::kvstore::KVStoreDist::EncodeDefaultKey(int,
 unsigned long, int)+0x159) [0x7fa4814e86a9]
     [bt] (0) 
/usr/local/lib/python3.8/dist-packages/mxnet/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f)
 [0x7fa4812077ef]
     File "../src/kvstore/./kvstore_dist.h", line 627
   ```
   
   ```MXNetError: Check failed: static_cast<size_t>(pskv.size) == pskv_size 
(172770864 vs. 447000596) : The value size cannot be changed 447000596. Key is 
3```
   
   Error message from Horovod Implementation
   
   ```
   2022-08-08T18:52:00.918-07:00        [1,0]<stderr>:terminate called after 
throwing an instance of 'std::logic_error'
   
   2022-08-08T18:52:00.918-07:00        [1,0]<stderr>: what(): 
cudaEventSynchronize failed: an illegal memory access was encountered
   
   2022-08-08T18:52:00.918-07:00        [1,0]<stderr>:[algo-1:00036] *** 
Process received signal ***
   
   2022-08-08T18:52:00.918-07:00        [1,0]<stderr>:[algo-1:00036] Signal: 
Aborted (6)
   
   2022-08-08T18:52:00.918-07:00        [1,0]<stderr>:[algo-1:00036] Signal 
code: (-6)
   
   2022-08-08T18:52:00.918-07:00        [1,0]<stderr>:[algo-1:00036] [ 0] 
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f2add200090]
   
   2022-08-08T18:52:00.918-07:00        [1,0]<stderr>:[algo-1:00036] [ 1] 
/usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f2add20000b]
   
   2022-08-08T18:52:00.918-07:00        [1,0]<stderr>:[algo-1:00036] [ 2] 
/usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f2add1df859]
   
   2022-08-08T18:52:00.918-07:00        [1,0]<stderr>:[algo-1:00036] [ 3] 
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7f2a5c1ad911]
   
   2022-08-08T18:52:00.918-07:00        [1,0]<stderr>:[algo-1:00036] [ 4] 
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7f2a5c1b938c]
   
   2022-08-08T18:52:00.918-07:00        [1,0]<stderr>:[algo-1:00036] [ 5] 
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7f2a5c1b93f7]
   
   2022-08-08T18:52:00.918-07:00        [1,0]<stderr>:[algo-1:00036] [ 6] 
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7f2a5c1b96a9]
   
   2022-08-08T18:52:00.918-07:00        [1,0]<stderr>:[algo-1:00036] [ 7] 
/usr/local/lib/python3.8/dist-packages/horovod/mxnet/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common10GPUContext4impl13WaitForEventsERSt5queueISt4pairINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS0_5EventEESt5dequeISC_SaISC_EEERKSt6vectorINS0_16TensorTableEntryESaISJ_EERNS0_8TimelineERKSt8functionIFvvEE+0x8a1)[0x7f29f2f94b61]
   
   2022-08-08T18:52:00.918-07:00        [1,0]<stderr>:[algo-1:00036] [ 8] 
/usr/local/lib/python3.8/dist-packages/horovod/mxnet/mpi_lib.cpython-38-x86_64-linux-gnu.so(+0x1317a7)[0x7f29f2f957a7]
   
   2022-08-08T18:52:00.918-07:00        [1,0]<stderr>:[algo-1:00036] [ 9] 
/usr/local/lib/python3.8/dist-packages/horovod/mxnet/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common10ThreadPool4loopEv+0x170)[0x7f29f2f52250]
   
   2022-08-08T18:52:00.918-07:00        [1,0]<stderr>:[algo-1:00036] [10] 
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7f2a5c1e5de4]
   
   2022-08-08T18:52:00.918-07:00        [1,0]<stderr>:[algo-1:00036] [11] 
/usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f2add1a2609]
   
   2022-08-08T18:52:00.918-07:00        [1,0]<stderr>:[algo-1:00036] [12] 
/usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f2add2dc133]
   
   2022-08-08T18:52:00.918-07:00        [1,0]<stderr>:[algo-1:00036] *** End of 
error message ***
   
   2022-08-08T18:52:00.918-07:00        
--------------------------------------------------------------------------
   
   2022-08-08T18:52:00.918-07:00        Primary job terminated normally, but 1 
process returned
   
   2022-08-08T18:52:00.918-07:00        a non-zero exit code. Per 
user-direction, the job has been aborted.
   
   2022-08-08T18:52:00.918-07:00        
--------------------------------------------------------------------------
   
   2022-08-08T18:52:01.918-07:00        
[1,0]<stderr>:/usr/lib/python3.8/multiprocessing/resource_tracker.py:216: 
UserWarning: resource_tracker: There appear to be 3 leaked semaphore objects to 
clean up at shutdown
   
   2022-08-08T18:52:01.919-07:00        [1,0]<stderr>: 
warnings.warn('resource_tracker: There appear to be %d '
   
   2022-08-08T18:52:02.919-07:00        
--------------------------------------------------------------------------
   
   2022-08-08T18:52:02.919-07:00        mpirun.real noticed that process rank 1 
with PID 43 on node algo-2 exited on signal 6 (Aborted).
   ```
   
   The Sync PS implementation does not crash, but has very high loss.
   
   ## To Reproduce
   (If you developed your own code, please provide a short script that 
reproduces the error. For existing examples, please provide link.)
   
   PS Async
   ```
   trainer = gluon.Trainer(model.collect_params(),
                               optimizer,
                               optimizer_params,kvstore="dist_async", 
update_on_kvstore=True)```
   ```
   Horovod:
   
   Only difference is in how the trainer is created
   ```
   opt = mx.optimizer.create(optimizer, **optimizer_params)
       
   hvd.init()
   
   assert hvd.mpi_threads_supported()
   from mpi4py import MPI
   comm = MPI.COMM_WORLD
   
   params = model.collect_params()
   if params is not None:
       hvd.broadcast_parameters(params, root_rank=0)
   
   trainer = hvd.DistributedTrainer(params, opt)
   ```
   
   The stepping is done as:
   ```
   grads = [i.grad(ctx) for i in model.collect_params().values()
               if i.grad_req != 'null']
   trainer.step(batch_size, ignore_stale_grad=True)
   model.collect_params().zero_grad()  # for dangling nodes
   ````
   
   The sparse nature of the data is passed through the nn.Embedding block and 
into the loss function.
   
   I am withholding parts of the code for the purpose of data security
   
   ### Steps to reproduce
   (Paste the commands you ran that produced the error.)
   
   1. Using AWS Sagemaker to run multi-instance training, launched with toolkit 
https://github.com/aws/sagemaker-mxnet-training-toolkit/tree/2f26babd9ba72f48d2336f7817d8255b6b2a2adc/src/sagemaker_mxnet_container
   2. Use dataset with sparse gradients (very large vocabulary size). Note that 
training is fine on this dataset when only a single instance is used3. 3. 
   ## What have you tried to solve it?
   
   1. Change MXNet versions (downgraded to 1.8)
   2. Try 3 different approaches to distributed training
   3. Use logging to identify breaking point.
   4. Attempt to understand the back-end implementation of kvstore_dist.h, but 
was unable to make sense of the error message line. 
   ## Environment
   
   MXNet 1.9.0
   Horovod 0.19.0
   OpenMPI 4.0.1
   cuda 11.2.2
   protobuf==3.20.1 \
   h5py==2.10.0 \
   onnx==1.8.1 \
   "numpy<1.20" \
   pandas==1.3.0 \
   "Pillow>=9.0,<10.0" \
   "requests<3" \
   scikit-learn \
   # disabling DGL until a release is built for MXNet 1.9 and CUDA 11.2
   # dgl-cu112 \
   scipy==1.7.0 \
   gluonnlp==0.10.0 \
   gluoncv==0.8.0 \
   # Putting a cap in versions number to avoid potential issues with a new 
major version
   "urllib3<2" \
   # python-dateutil==2.8.0 to satisfy botocore associated with latest awscli
   python-dateutil==2.8.0 \
   tqdm==4.39.0 \
   # install PyYAML>=5.4,<5.5 to avoid conflict with latest awscli
   "PyYAML>=5.4,<5.5" \
   mpi4py==3.0.3 \
   ${MX_URL} \
   awscli \
   s3fs==0.4.2 \
   opencv-python
   Instance type: AWS ml.g4dn.xlarge
   
   </details>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org

[GitHub] [incubator-mxnet] BlakeLazarine opened a new issue, #21118: Issue when running Distributed Training with Sparse Gradients

Reply via email to