BlakeLazarine opened a new issue, #21118: URL: https://github.com/apache/incubator-mxnet/issues/21118
## Description (A clear and concise description of what the bug is.) When running distributed training (multi-instance with each instance having a single GPU) with sparse gradients (produced by negative sampling), MXNet crashes. I have implementations that use either Synchronous or Asynchronous Parameter server implementations or use Horovod for distributed training. All of these implementations are able to train on datasets which do not use sparse gradients and the Horovod implementation successfully trains when there is only 1 instance. The Async case seems to be the most descriptive of the issue, but I am still unable to make use of it. ### Error Message Error message from Async PS implementation ``` mxnet.base.MXNetError: Traceback (most recent call last): [bt] (8) /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fa4c8b95133] [bt] (7) /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7fa4c8a5b609] [bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd7172) [0x7fa447ade172] [bt] (5) /usr/local/lib/python3.8/dist-packages/mxnet/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x3b) [0x7fa4813459fb] [bt] (4) /usr/local/lib/python3.8/dist-packages/mxnet/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>)+0x104) [0x7fa481348624] [bt] (3) /usr/local/lib/python3.8/dist-packages/mxnet/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x48b) [0x7fa48134694b] [bt] (2) /usr/local/lib/python3.8/dist-packages/mxnet/libmxnet.so(mxnet::kvstore::KVStoreDist::PullDefault(int, mxnet::NDArray const&, int)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#1}::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const+0x5c) [0x7fa48151382c] [bt] (1) /usr/local/lib/python3.8/dist-packages/mxnet/libmxnet.so(mxnet::kvstore::KVStoreDist::EncodeDefaultKey(int, unsigned long, int)+0x159) [0x7fa4814e86a9] [bt] (0) /usr/local/lib/python3.8/dist-packages/mxnet/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7fa4812077ef] File "../src/kvstore/./kvstore_dist.h", line 627 ``` ```MXNetError: Check failed: static_cast<size_t>(pskv.size) == pskv_size (172770864 vs. 447000596) : The value size cannot be changed 447000596. Key is 3``` Error message from Horovod Implementation ``` 2022-08-08T18:52:00.918-07:00 [1,0]<stderr>:terminate called after throwing an instance of 'std::logic_error' 2022-08-08T18:52:00.918-07:00 [1,0]<stderr>: what(): cudaEventSynchronize failed: an illegal memory access was encountered 2022-08-08T18:52:00.918-07:00 [1,0]<stderr>:[algo-1:00036] *** Process received signal *** 2022-08-08T18:52:00.918-07:00 [1,0]<stderr>:[algo-1:00036] Signal: Aborted (6) 2022-08-08T18:52:00.918-07:00 [1,0]<stderr>:[algo-1:00036] Signal code: (-6) 2022-08-08T18:52:00.918-07:00 [1,0]<stderr>:[algo-1:00036] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f2add200090] 2022-08-08T18:52:00.918-07:00 [1,0]<stderr>:[algo-1:00036] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f2add20000b] 2022-08-08T18:52:00.918-07:00 [1,0]<stderr>:[algo-1:00036] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f2add1df859] 2022-08-08T18:52:00.918-07:00 [1,0]<stderr>:[algo-1:00036] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7f2a5c1ad911] 2022-08-08T18:52:00.918-07:00 [1,0]<stderr>:[algo-1:00036] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7f2a5c1b938c] 2022-08-08T18:52:00.918-07:00 [1,0]<stderr>:[algo-1:00036] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7f2a5c1b93f7] 2022-08-08T18:52:00.918-07:00 [1,0]<stderr>:[algo-1:00036] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7f2a5c1b96a9] 2022-08-08T18:52:00.918-07:00 [1,0]<stderr>:[algo-1:00036] [ 7] /usr/local/lib/python3.8/dist-packages/horovod/mxnet/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common10GPUContext4impl13WaitForEventsERSt5queueISt4pairINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS0_5EventEESt5dequeISC_SaISC_EEERKSt6vectorINS0_16TensorTableEntryESaISJ_EERNS0_8TimelineERKSt8functionIFvvEE+0x8a1)[0x7f29f2f94b61] 2022-08-08T18:52:00.918-07:00 [1,0]<stderr>:[algo-1:00036] [ 8] /usr/local/lib/python3.8/dist-packages/horovod/mxnet/mpi_lib.cpython-38-x86_64-linux-gnu.so(+0x1317a7)[0x7f29f2f957a7] 2022-08-08T18:52:00.918-07:00 [1,0]<stderr>:[algo-1:00036] [ 9] /usr/local/lib/python3.8/dist-packages/horovod/mxnet/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common10ThreadPool4loopEv+0x170)[0x7f29f2f52250] 2022-08-08T18:52:00.918-07:00 [1,0]<stderr>:[algo-1:00036] [10] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7f2a5c1e5de4] 2022-08-08T18:52:00.918-07:00 [1,0]<stderr>:[algo-1:00036] [11] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f2add1a2609] 2022-08-08T18:52:00.918-07:00 [1,0]<stderr>:[algo-1:00036] [12] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f2add2dc133] 2022-08-08T18:52:00.918-07:00 [1,0]<stderr>:[algo-1:00036] *** End of error message *** 2022-08-08T18:52:00.918-07:00 -------------------------------------------------------------------------- 2022-08-08T18:52:00.918-07:00 Primary job terminated normally, but 1 process returned 2022-08-08T18:52:00.918-07:00 a non-zero exit code. Per user-direction, the job has been aborted. 2022-08-08T18:52:00.918-07:00 -------------------------------------------------------------------------- 2022-08-08T18:52:01.918-07:00 [1,0]<stderr>:/usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 3 leaked semaphore objects to clean up at shutdown 2022-08-08T18:52:01.919-07:00 [1,0]<stderr>: warnings.warn('resource_tracker: There appear to be %d ' 2022-08-08T18:52:02.919-07:00 -------------------------------------------------------------------------- 2022-08-08T18:52:02.919-07:00 mpirun.real noticed that process rank 1 with PID 43 on node algo-2 exited on signal 6 (Aborted). ``` The Sync PS implementation does not crash, but has very high loss. ## To Reproduce (If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.) PS Async ``` trainer = gluon.Trainer(model.collect_params(), optimizer, optimizer_params,kvstore="dist_async", update_on_kvstore=True)``` ``` Horovod: Only difference is in how the trainer is created ``` opt = mx.optimizer.create(optimizer, **optimizer_params) hvd.init() assert hvd.mpi_threads_supported() from mpi4py import MPI comm = MPI.COMM_WORLD params = model.collect_params() if params is not None: hvd.broadcast_parameters(params, root_rank=0) trainer = hvd.DistributedTrainer(params, opt) ``` The stepping is done as: ``` grads = [i.grad(ctx) for i in model.collect_params().values() if i.grad_req != 'null'] trainer.step(batch_size, ignore_stale_grad=True) model.collect_params().zero_grad() # for dangling nodes ```` The sparse nature of the data is passed through the nn.Embedding block and into the loss function. I am withholding parts of the code for the purpose of data security ### Steps to reproduce (Paste the commands you ran that produced the error.) 1. Using AWS Sagemaker to run multi-instance training, launched with toolkit https://github.com/aws/sagemaker-mxnet-training-toolkit/tree/2f26babd9ba72f48d2336f7817d8255b6b2a2adc/src/sagemaker_mxnet_container 2. Use dataset with sparse gradients (very large vocabulary size). Note that training is fine on this dataset when only a single instance is used3. 3. ## What have you tried to solve it? 1. Change MXNet versions (downgraded to 1.8) 2. Try 3 different approaches to distributed training 3. Use logging to identify breaking point. 4. Attempt to understand the back-end implementation of kvstore_dist.h, but was unable to make sense of the error message line. ## Environment MXNet 1.9.0 Horovod 0.19.0 OpenMPI 4.0.1 cuda 11.2.2 protobuf==3.20.1 \ h5py==2.10.0 \ onnx==1.8.1 \ "numpy<1.20" \ pandas==1.3.0 \ "Pillow>=9.0,<10.0" \ "requests<3" \ scikit-learn \ # disabling DGL until a release is built for MXNet 1.9 and CUDA 11.2 # dgl-cu112 \ scipy==1.7.0 \ gluonnlp==0.10.0 \ gluoncv==0.8.0 \ # Putting a cap in versions number to avoid potential issues with a new major version "urllib3<2" \ # python-dateutil==2.8.0 to satisfy botocore associated with latest awscli python-dateutil==2.8.0 \ tqdm==4.39.0 \ # install PyYAML>=5.4,<5.5 to avoid conflict with latest awscli "PyYAML>=5.4,<5.5" \ mpi4py==3.0.3 \ ${MX_URL} \ awscli \ s3fs==0.4.2 \ opencv-python Instance type: AWS ml.g4dn.xlarge </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org