solin319 commented on issue #8097: speed problem in distribute training
URL: 
https://github.com/apache/incubator-mxnet/issues/8097#issuecomment-333274173
 
 
   Train vgg16 with two distribute machines (total 8 gpus).
   
   python ../../tools/launch.py -n 2 --launcher ssh -H hosts `which python` 
train_imagenet.py \
   --data-train=/data/ILSVRC2012_img_train.rec \
   --data-val=/data/ILSVRC2012_img_val.rec \
   --network=vgg \
   --num-layers=16 \
   --kv-store=dist_sync \
   --gpus=0,1,2,3 \
   --disp-batch=10 \
   --top-k=5 \
   --batch-size=128 \
   --dtype=float32 \
   
   a. The result after remove WaitToWrite:
   
   INFO:root:Epoch[0] Batch [10]        Speed: 93.98 samples/sec        
accuracy=0.001420       top_k_accuracy_5=0.003551
   INFO:root:Epoch[0] Batch [10]        Speed: 89.04 samples/sec        
accuracy=0.000000       top_k_accuracy_5=0.004261
   INFO:root:Epoch[0] Batch [20]        Speed: 98.28 samples/sec        
accuracy=0.001563       top_k_accuracy_5=0.005469
   INFO:root:Epoch[0] Batch [20]        Speed: 96.96 samples/sec        
accuracy=0.000781       top_k_accuracy_5=0.003125
   INFO:root:Epoch[0] Batch [30]        Speed: 96.77 samples/sec        
accuracy=0.001563       top_k_accuracy_5=0.006250
   INFO:root:Epoch[0] Batch [30]        Speed: 96.12 samples/sec        
accuracy=0.001563       top_k_accuracy_5=0.002344
   INFO:root:Epoch[0] Batch [40]        Speed: 95.42 samples/sec        
accuracy=0.001563       top_k_accuracy_5=0.006250
   INFO:root:Epoch[0] Batch [40]        Speed: 96.40 samples/sec        
accuracy=0.001563       top_k_accuracy_5=0.005469
   INFO:root:Epoch[0] Batch [50]        Speed: 98.69 samples/sec        
accuracy=0.000000       top_k_accuracy_5=0.006250
   INFO:root:Epoch[0] Batch [50]        Speed: 98.15 samples/sec        
accuracy=0.000781       top_k_accuracy_5=0.005469
   INFO:root:Epoch[0] Batch [60]        Speed: 94.75 samples/sec        
accuracy=0.001563       top_k_accuracy_5=0.004687
   INFO:root:Epoch[0] Batch [60]        Speed: 94.82 samples/sec        
accuracy=0.000000       top_k_accuracy_5=0.003906
   
   profile:
   
[remove.zip](https://github.com/apache/incubator-mxnet/files/1346028/remove.zip)
   
   b. The origin result
   INFO:root:Epoch[0] Batch [10]        Speed: 76.15 samples/sec        
accuracy=0.000000       top_k_accuracy_5=0.004261
   INFO:root:Epoch[0] Batch [10]        Speed: 80.05 samples/sec        
accuracy=0.001420       top_k_accuracy_5=0.003551
   INFO:root:Epoch[0] Batch [20]        Speed: 81.68 samples/sec        
accuracy=0.001563       top_k_accuracy_5=0.005469
   INFO:root:Epoch[0] Batch [20]        Speed: 81.62 samples/sec        
accuracy=0.000781       top_k_accuracy_5=0.003125
   INFO:root:Epoch[0] Batch [30]        Speed: 82.43 samples/sec        
accuracy=0.001563       top_k_accuracy_5=0.006250
   INFO:root:Epoch[0] Batch [30]        Speed: 82.28 samples/sec        
accuracy=0.001563       top_k_accuracy_5=0.002344
   INFO:root:Epoch[0] Batch [40]        Speed: 81.18 samples/sec        
accuracy=0.001563       top_k_accuracy_5=0.005469
   INFO:root:Epoch[0] Batch [40]        Speed: 80.52 samples/sec        
accuracy=0.001563       top_k_accuracy_5=0.006250
   INFO:root:Epoch[0] Batch [50]        Speed: 80.65 samples/sec        
accuracy=0.000781       top_k_accuracy_5=0.005469
   INFO:root:Epoch[0] Batch [50]        Speed: 80.63 samples/sec        
accuracy=0.000000       top_k_accuracy_5=0.006250
   INFO:root:Epoch[0] Batch [60]        Speed: 81.04 samples/sec        
accuracy=0.000000       top_k_accuracy_5=0.003906
   INFO:root:Epoch[0] Batch [60]        Speed: 80.81 samples/sec        
accuracy=0.001563       top_k_accuracy_5=0.004687
   
   profile:
   
[origin.zip](https://github.com/apache/incubator-mxnet/files/1346025/origin.zip)
   
   The profile can be opened by chrome://tracing
   @eric-haibin-lin 
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to