[GitHub] eric-haibin-lin commented on a change in pull request #13852: Fix Tree Reduction on new instance type p3dn.24xlarge

GitBox Fri, 11 Jan 2019 13:06:45 -0800

eric-haibin-lin commented on a change in pull request #13852: Fix Tree 
Reduction on new instance type p3dn.24xlarge
URL: https://github.com/apache/incubator-mxnet/pull/13852#discussion_r247253405


 ##########
 File path: src/kvstore/gpu_topology.h
 ##########
 @@ -161,11 +166,59 @@ inline void GetP2PWeight(const std::vector<Context>& 
devs, std::vector<T>* matri
     }
   }
 
-  // Check that all GPUs have at least 1 NVLink connection
-  int max_value = 0;
-  for (unsigned int i = 0; i < max.size(); ++i) {
-    if (max[i] > max_value)
-      max_value = max[i];
+  // Check that all P2P connections are detected by GetP2PAttribute
+  // If yes, then continue as before
+  // If not, then treat fallback to using p2p_matrix (from EnableP2P)
+  //
+  // We have observed that with CUDA 9.0 p3.16xlarge:
+  //
+  //   0 2 2 3 3 1 1 1    . v v v v . . .
+  //   2 0 3 2 1 3 1 1    v . v v . v . .
+  //   2 3 0 3 1 1 2 1    v v . v . . v .
+  //   3 2 3 0 1 1 1 2    v v v . . . . v
+  //   3 1 1 1 0 2 2 3    v . . . . v v v
+  //   1 3 1 1 2 0 3 2    . v . . v . v v
+  //   1 1 2 1 2 3 0 3    . . v . v v . v
+  //   1 1 1 2 3 2 3 0    . . . v v v v .
+  //
+  //        matrix           p2p_matrix
+  //
+  // Here, they are correctly detected, because the 2s and 3s correspond to
+  // links that have P2P connections between them. However for CUDA 9.2 
p3.16xlarge:
+  //
+  //   0 2 2 1 1 1 1 1    . v v v v . . .
+  //   2 0 1 2 1 1 1 1    v . v v . v . .
+  //   2 1 0 1 1 1 2 1    v v . v . . v .
+  //   1 2 1 0 1 1 1 2    v v v . . . . v
+  //   1 1 1 1 0 2 2 1    v . . . . v v v
+  //   1 1 1 1 2 0 1 2    . v . . v . v v
+  //   1 1 2 1 2 1 0 1    . . v . v v . v
+  //   1 1 1 2 1 2 1 0    . . . v v v v .
+  //
+  //        matrix          p2p_matrix
+  //
+  // The fastest connections (3 - double NVLink) are not recognized as being 
any
+  if (kLogTree) {
+    PrintMatrix("matrix", *matrix, num_gpus, num_gpus);
+    PrintMatrix("p2p_matrix", p2p_matrix, num_gpus, num_gpus);
+  }
+
+  // different from (1 - non-P2P PCI-E). This is why we fallback to p2p_matrix.
+  bool matrix_correct = true;
+  for (unsigned i = 0; i < p2p_matrix.size(); ++i) {
+    if (p2p_matrix[i] > 0 && (*matrix)[i] == 1) {
+      matrix_correct = false;
+      break;
+    }
+  }
+
+  if (!matrix_correct) {
 
 Review comment:
   Could you add a warning message that fallback happened? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] eric-haibin-lin commented on a change in pull request #13852: Fix Tree Reduction on new instance type p3dn.24xlarge

Reply via email to