Any help may needed from your side. expect your replay. thanks a lot.
I used tf2.5 to do the distributed training, device config detail info is 2 
machince with 2GPUs each on them, means 2m4GPU. And these 2machinces network 
are connected by the optical fiber and bandwidth can be support 10Gbit/s = 
1250MB/s.  the device include machinces or GPU type or memory with these 
2machinse are ALL same.

Let's compare following 2 test case:

1m2GPU:
I used the TF ditributed startagy MirroredStrategy to do the 1machinse 2 GPU 
trianing, cost TRAINING time is 1522 seconds which the training task is albert 
base 12layers text classification tast. the GPU memory percent used is96.88% 
and utilation is  95.78%;

2m4GPU:
I used the TF ditributed startagy MultiWorkerMirroredStrategy to do the 
2machinse 4 GPU trianing, cost TRAINING time is 1013 seconds which the training 
task is albert base 12layers text classification tast. the GPU memory percent 
used is82.99% and utilation is  71.89%;

Comparw with 1m2GPu with 2m4GPU, 
found that mulit machinces multic GPU saving time just : (1522-1013)/1522 = 
33.4% (maybe) < 50% as expected . I also monitor the network bandwidth between 
2 machines, found 152MB/s used in average.
The CPU used and memory used is 170% and 2.25% and all are not the bottleneck
Any method can i used to improve and accelerate the multi machinse multi GPU 
distributed training?

other question, during multi machinse multi GPU distributed training, the GPU 
memory usage is lower than the single machines multc GPU training , what's root 
cause of this? and GPU utilation is also becomer lower.

I also check some method to accelerate , such as use the data input 
pipeline(e,g, prefetch/map_and_batch/num_parallel_batches/shuffle/repeat ). 
seems that pipeline doesn't bring obvious acclerate change.
Do you hava any good suggestion to acclerate this ?
Thanks a lot.





---
[Visit 
Topic](https://discuss.mxnet.apache.org/t/multi-system-multi-gpu-distributed-training-slower-than-single-system-multi-gpu/1270/6)
 or reply to this email to respond.

You are receiving this because you enabled mailing list mode.

To unsubscribe from these emails, [click 
here](https://discuss.mxnet.apache.org/email/unsubscribe/0457218833bd735f4d33c94ff500832fc536fc1239aff926c60a09d426da2abc).

Reply via email to