solin319 commented on issue #8097: speed problem in distribute training URL: https://github.com/apache/incubator-mxnet/issues/8097#issuecomment-333274173 Train vgg16 with two distribute machines (total 8 gpus). python ../../tools/launch.py -n 2 --launcher ssh -H hosts `which python` train_imagenet.py \ --data-train=/data/ILSVRC2012_img_train.rec \ --data-val=/data/ILSVRC2012_img_val.rec \ --network=vgg \ --num-layers=16 \ --kv-store=dist_sync \ --gpus=0,1,2,3 \ --disp-batch=10 \ --top-k=5 \ --batch-size=128 \ --dtype=float32 \ a. The result after remove WaitToWrite: INFO:root:Epoch[0] Batch [10] Speed: 93.98 samples/sec accuracy=0.001420 top_k_accuracy_5=0.003551 INFO:root:Epoch[0] Batch [10] Speed: 89.04 samples/sec accuracy=0.000000 top_k_accuracy_5=0.004261 INFO:root:Epoch[0] Batch [20] Speed: 98.28 samples/sec accuracy=0.001563 top_k_accuracy_5=0.005469 INFO:root:Epoch[0] Batch [20] Speed: 96.96 samples/sec accuracy=0.000781 top_k_accuracy_5=0.003125 INFO:root:Epoch[0] Batch [30] Speed: 96.77 samples/sec accuracy=0.001563 top_k_accuracy_5=0.006250 INFO:root:Epoch[0] Batch [30] Speed: 96.12 samples/sec accuracy=0.001563 top_k_accuracy_5=0.002344 INFO:root:Epoch[0] Batch [40] Speed: 95.42 samples/sec accuracy=0.001563 top_k_accuracy_5=0.006250 INFO:root:Epoch[0] Batch [40] Speed: 96.40 samples/sec accuracy=0.001563 top_k_accuracy_5=0.005469 INFO:root:Epoch[0] Batch [50] Speed: 98.69 samples/sec accuracy=0.000000 top_k_accuracy_5=0.006250 INFO:root:Epoch[0] Batch [50] Speed: 98.15 samples/sec accuracy=0.000781 top_k_accuracy_5=0.005469 INFO:root:Epoch[0] Batch [60] Speed: 94.75 samples/sec accuracy=0.001563 top_k_accuracy_5=0.004687 INFO:root:Epoch[0] Batch [60] Speed: 94.82 samples/sec accuracy=0.000000 top_k_accuracy_5=0.003906 profile: [remove.zip](https://github.com/apache/incubator-mxnet/files/1346028/remove.zip) b. The origin result INFO:root:Epoch[0] Batch [10] Speed: 76.15 samples/sec accuracy=0.000000 top_k_accuracy_5=0.004261 INFO:root:Epoch[0] Batch [10] Speed: 80.05 samples/sec accuracy=0.001420 top_k_accuracy_5=0.003551 INFO:root:Epoch[0] Batch [20] Speed: 81.68 samples/sec accuracy=0.001563 top_k_accuracy_5=0.005469 INFO:root:Epoch[0] Batch [20] Speed: 81.62 samples/sec accuracy=0.000781 top_k_accuracy_5=0.003125 INFO:root:Epoch[0] Batch [30] Speed: 82.43 samples/sec accuracy=0.001563 top_k_accuracy_5=0.006250 INFO:root:Epoch[0] Batch [30] Speed: 82.28 samples/sec accuracy=0.001563 top_k_accuracy_5=0.002344 INFO:root:Epoch[0] Batch [40] Speed: 81.18 samples/sec accuracy=0.001563 top_k_accuracy_5=0.005469 INFO:root:Epoch[0] Batch [40] Speed: 80.52 samples/sec accuracy=0.001563 top_k_accuracy_5=0.006250 INFO:root:Epoch[0] Batch [50] Speed: 80.65 samples/sec accuracy=0.000781 top_k_accuracy_5=0.005469 INFO:root:Epoch[0] Batch [50] Speed: 80.63 samples/sec accuracy=0.000000 top_k_accuracy_5=0.006250 INFO:root:Epoch[0] Batch [60] Speed: 81.04 samples/sec accuracy=0.000000 top_k_accuracy_5=0.003906 INFO:root:Epoch[0] Batch [60] Speed: 80.81 samples/sec accuracy=0.001563 top_k_accuracy_5=0.004687 profile: [origin.zip](https://github.com/apache/incubator-mxnet/files/1346025/origin.zip) The profile can be opened by chrome://tracing @eric-haibin-lin ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
With regards, Apache Git Services
