szhengac commented on issue #19155: URL: https://github.com/apache/incubator-mxnet/issues/19155#issuecomment-700416738
@ptrendx here is more error msg: `[1,3]<stderr>:[ip-172-31-3-104:27783] *** Process received signal *** [1,3]<stderr>:[ip-172-31-3-104:27783] Signal: Segmentation fault (11) [1,3]<stderr>:[ip-172-31-3-104:27783] Signal code: Address not mapped (1) [1,3]<stderr>:[ip-172-31-3-104:27783] Failing at address: 0xfffffffffffffffc [1,3]<stderr>:[ip-172-31-3-104:27783] [ 0] /lib64/libpthread.so.0(+0x117e0)[0x7f16244fc7e0] [1,3]<stderr>:[ip-172-31-3-104:27783] [ 1] [1,0]<stderr>:[ip-172-31-3-104:27780] *** Process received signal *** [1,0]<stderr>:[ip-172-31-3-104:27780] Signal: Segmentation fault (11) [1,0]<stderr>:[ip-172-31-3-104:27780] Signal code: Address not mapped (1) [1,0]<stderr>:[ip-172-31-3-104:27780] Failing at address: 0x10e11781c [1,0]<stderr>:[ip-172-31-3-104:27780] [ 0] [1,0]<stderr>:/lib64/libpthread.so.0(+0x117e0)[0x7f10182ee7e0] [1,0]<stderr>:[ip-172-31-3-104:27780] [ 1] [1,3]<stderr>:/home/ec2-user/mxnet-private/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7storage30GPUPooledRoundedStorageManager5AllocEPNS_7Storage6HandleE+0x9d)[ 0x7f15ce34d61d] [1,3]<stderr>:[ip-172-31-3-104:27783] [ 2] [1,0]<stderr>:/home/ec2-user/mxnet-private/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7storage30GPUPooledRoundedStorageManager5AllocEPNS_7Storage6HandleE+0x9d)[ 0x7f0fc217f61d] [1,0]<stderr>:[ip-172-31-3-104:27780] [ 2] [1,3]<stderr>:/home/ec2-user/mxnet-private/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet11StorageImpl5AllocEPNS_7Storage6HandleE+0x4a)[0x7f15ce34fd9a] [1,3]<stderr>:[ip-172-31-3-104:27783] [ 3] [1,0]<stderr>:/home/ec2-user/mxnet-private/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet11StorageImpl5AllocEPNS_7Storage6HandleE+0x4a)[0x7f0fc2181d9a] [1,0]<stderr>:[ip-172-31-3-104:27780] [ 3] [1,5]<stderr>:[ip-172-31-3-104:27785] *** Process received signal ***` It would take some step for you to reproduce it. You need to use gluon-nlp 0.9 or 0.10, and prepare a sentencepiece vocab of size 52000. Then, add `num_layers=3` to https://github.com/dmlc/gluon-nlp/blob/3fbe9619d9e68bc665f73c8cdf683213c6edd4d6/scripts/bert/pretraining_utils.py#L72, and run a distributed training job using Horovod on a single node. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
