[
https://issues.apache.org/jira/browse/MXNET-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16645504#comment-16645504
]
Lin Yuan commented on MXNET-1027:
---------------------------------
Hi Carl,
Can you also post the command/script to reproduce this error? Thanks
Lin
> Horovod Random Segfault during Training
> ---------------------------------------
>
> Key: MXNET-1027
> URL: https://issues.apache.org/jira/browse/MXNET-1027
> Project: Apache MXNet
> Issue Type: Bug
> Components: Horovod
> Reporter: Carl Yang
> Priority: Minor
>
> setup: 8 GPUs on p3.16xlarge
> commit: most-likely Horovod branch: (0a0240113fe5a24ec2c772fd7309840ba179562a)
> nohup: ignoring input and appending output to 'nohup.out'
> INFO:root:start with arguments Namespace(batch_size=128, benchmark=0,
> brightness=0.4, contrast=0.4, data_nthreads=4,
> data_train='/media/ramdisk/train-passthrough.rec',
> data_train_idx='/media/ramdisk/train-passthrough.idx',
> data_val='/media/ramdisk/val-passthrough.rec',
> data_val_idx='/media/ramdisk/val-passthrough.idx', disp_batches=20,
> dtype='float32', gc_threshold=0.5, gc_type='none', gpus='0',
> image_shape='3,224,224', initializer='default', kv_store='None',
> load_epoch=None, loss='', lr=0.8, lr_factor=0.1, lr_step_epochs='30,60,80',
> macrobatch_size=0, max_random_area=1,
> max_random_aspect_ratio=1.3333333333333333, max_random_h=0, max_random_l=0,
> max_random_rotate_angle=0, max_random_s=0, max_random_scale=1,
> max_random_shear_ratio=0, min_random_area=0.08, min_random_aspect_ratio=0.75,
> min_random_scale=1, model_prefix=None, mom=0.9, monitor=0,
> network='resnet-v1', num_classes=1000, num_epochs=90, num_examples=1281167,
> num_layers=50, optimizer='sgd', pad_size=0, pca_noise=0.1, random_crop=0,
> random_mirror=0, random_resized_crop=1, rgb_mean='123.68,116.779,103.939',
> saturation=0.4, save_period=1, test_io=0, top_k=0, warmup_epochs=10,
> warmup_strategy='linear', wd=0.0001)
> …
> INFO:root:Epoch[67] Batch [1140-1160] Speed: 334.12 samples/sec
> accuracy=0.710156
> INFO:root:Epoch[67] Batch [1140-1160] Speed: 335.77 samples/sec
> accuracy=0.719922
> INFO:root:Epoch[67] Batch [1140-1160] Speed: 334.73 samples/sec
> accuracy=0.714063
> INFO:root:Epoch[67] Batch [1140-1160] Speed: 334.85 samples/sec
> accuracy=0.721875
> INFO:root:Epoch[67] Batch [1140-1160] Speed: 334.34 samples/sec
> accuracy=0.711719
> INFO:root:Epoch[67] Batch [1140-1160] Speed: 333.82 samples/sec
> accuracy=0.714844
> INFO:root:Epoch[67] Batch [1160-1180] Speed: 329.31 samples/sec
> accuracy=0.722656
> INFO:root:Epoch[67] Batch [1160-1180] Speed: 329.31 samples/sec
> accuracy=0.705859
> INFO:root:Epoch[67] Batch [1160-1180] Speed: 329.25 samples/sec
> accuracy=0.712891
> INFO:root:Epoch[67] Batch [1160-1180] Speed: 329.31 samples/sec
> accuracy=0.723828
> INFO:root:Epoch[67] Batch [1160-1180] Speed: 329.26 samples/sec
> accuracy=0.717969
> INFO:root:Epoch[67] Batch [1160-1180] Speed: 329.71 samples/sec
> accuracy=0.716016
> INFO:root:Epoch[67] Batch [1160-1180] Speed: 329.03 samples/sec
> accuracy=0.722656
> INFO:root:Epoch[67] Batch [1160-1180] Speed: 329.27 samples/sec
> accuracy=0.716797
> Segmentation fault: 11
> Stack trace returned 8 entries:
> [bt] (0)
> /home/ubuntu/master/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b)
> [0x7f7233aacaeb]
> [bt] (1) /home/ubuntu/master/lib/libmxnet.so(+0x3e4d74f) [0x7f7236b9a74f]
> [bt] (2) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f724a0be4b0]
> [bt] (3)
> /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(horovod::MX::HandleManager::ExecuteCallback(int)+0x19)
> [0x7f7227ef7009]
> [bt] (4)
> /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(+0x24b2b)
> [0x7f7227edab2b]
> [bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f7180a6bc80]
> [bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f724a45a6ba]
> [bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f724a19041d]
> Segmentation fault: 11
> Stack trace returned 9 entries:
> [bt] (0)
> /home/ubuntu/master/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b)
> [0x7f7233aacaeb]
> [bt] (1) /home/ubuntu/master/lib/libmxnet.so(+0x3e4d74f) [0x7f7236b9a74f]
> [bt] (2) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f724a0be4b0]
> [bt] (3)
> /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(horovod::MX::HandleManager::ExecuteCallback(int)+0x19)
> [0x7f7227ef7009]
> [bt] (4)
> /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(+0x259fc)
> [0x7f7227edb9fc]
> [bt] (5)
> /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(+0x30e6a)
> [0x7f7227ee6e6a]
> [bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f7180a6bc80]
> [bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f724a45a6ba]
> [bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f724a19041d]
> terminate called after throwing an instance of 'std::system_error'
> what(): Resource deadlock avoided
> [ip-172-31-9-223:33837] *** Process received signal ***
> [ip-172-31-9-223:33837] Signal: Aborted (6)
> [ip-172-31-9-223:33837] Signal code: (-6)
> [ip-172-31-9-223:33837] [ 0]
> /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f724a464390]
> [ip-172-31-9-223:33837] [ 1]
> /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7f724a0be428]
> [ip-172-31-9-223:33837] [ 2]
> /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7f724a0c002a]
> [ip-172-31-9-223:33837] [ 3]
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x16d)[0x7f7180a4284d]
> [ip-172-31-9-223:33837] [ 4]
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8d6b6)[0x7f7180a406b6]
> [ip-172-31-9-223:33837] [ 5]
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c6a9)[0x7f7180a3f6a9]
> [ip-172-31-9-223:33837] [ 6]
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x2e5)[0x7f7180a40005]
> [ip-172-31-9-223:33837] [ 7]
> /lib/x86_64-linux-gnu/libgcc_s.so.1(+0xff83)[0x7f718058af83]
> [ip-172-31-9-223:33837] [ 8]
> /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0xfb)[0x7f718058b2eb]
> [ip-172-31-9-223:33837] [ 9]
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x5c)[0x7f7180a4090c]
> [ip-172-31-9-223:33837] [10]
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZSt20__throw_system_errori+0x8e)[0x7f7180a697fe]
> [ip-172-31-9-223:33837] [11]
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSt6thread4joinEv+0x18)[0x7f7180a6bb88]
> [ip-172-31-9-223:33837] [12]
> /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(+0x243e3)[0x7f7227eda3e3]
> [ip-172-31-9-223:33837] [13]
> /lib/x86_64-linux-gnu/libc.so.6(+0x39ff8)[0x7f724a0c2ff8]
> [ip-172-31-9-223:33837] [14]
> /lib/x86_64-linux-gnu/libc.so.6(+0x3a045)[0x7f724a0c3045]
> [ip-172-31-9-223:33837] [15]
> /home/ubuntu/master/lib/libmxnet.so(+0x3e4d786)[0x7f7236b9a786]
> [ip-172-31-9-223:33837] [16]
> /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x7f724a0be4b0]
> [ip-172-31-9-223:33837] [17]
> /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(_ZN7horovod2MX13HandleManager15ExecuteCallbackEi+0x19)[0x7f7227ef7009]
> [ip-172-31-9-223:33837] [18]
> /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(+0x259fc)[0x7f7227edb9fc]
> [ip-172-31-9-223:33837] [19]
> /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(+0x30e6a)[0x7f7227ee6e6a]
> [ip-172-31-9-223:33837] [20]
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80)[0x7f7180a6bc80]
> [ip-172-31-9-223:33837] [21]
> /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f724a45a6ba]
> [ip-172-31-9-223:33837] [22]
> /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f724a19041d]
> [ip-172-31-9-223:33837] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 5 with PID 0 on node ip-172-31-9-223 exited
> on signal 6 (Aborted).
> --------------------------------------------------------------------------
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]