leleamol commented on issue #9367: Limit the test_nccl to run on 8 GPUs only 
until NCCL2.1 issue is fixed.
URL: https://github.com/apache/incubator-mxnet/pull/9367#issuecomment-357395336
 
 
   @ptrendx Thanks for the explanation. Before sending out this code for 
review, I had tried to invoke the test_pushpull() function in 2 batches viz 
range(1..8) for 0th gpu and range (9..16) for 15th gpu. However, the test core 
dumps when it tries to process the second batch.  
   Having 2 scripts for higher number of gpus seems cumbersome if we want to 
put them in automation.
   We can continue with limiting the test to 8 GPUS only. I will modify the 
print statement to indicate that it is a hardware limitation.
   Does it sound ok?
   
   
   Here is how temporarily modified my script to invoke the "for" loop in 2 
batches. It core dumps in second loop.
   
   ```python
   m_gpus = range(1,1+num_gpus)
   
   #@unittest.skip("Test requires NCCL library installed and enabled during 
build")
   def test_nccl_pushpull(gpus):
       for shape, key in zip(shapes, keys):
           for n_gpus in gpus:
               kv_nccl = mx.kv.create('nccl')
               a = mx.nd.ones(shape, mx.gpu(0))
               cur_key = str(key*max(gpus)+n_gpus)
               kv_nccl.init(cur_key, a)
               arr_list = [mx.nd.ones(shape, mx.gpu(x)) for x in range(n_gpus)]
               res = [mx.nd.zeros(shape, mx.gpu(x)) for x in range(n_gpus)]
               kv_nccl.push(cur_key, arr_list)
               kv_nccl.pull(cur_key, res)
               for x in range(n_gpus):
                   assert(np.sum(np.abs((res[x]-n_gpus).asnumpy()))==0)
       print("First Passed")
       gpus = range(9,16)
       for shape, key in zip(shapes, keys):
           for n_gpus in gpus:
               kv_nccl = mx.kv.create('nccl')
               a = mx.nd.ones(shape, mx.gpu(15))
               cur_key = str(key*max(gpus)+n_gpus)
               kv_nccl.init(cur_key, a)
               arr_list = [mx.nd.ones(shape, mx.gpu(x)) for x in range(n_gpus)]
               res = [mx.nd.zeros(shape, mx.gpu(x)) for x in range(n_gpus)]
               kv_nccl.push(cur_key, arr_list)
               kv_nccl.pull(cur_key, res)
               for x in range(n_gpus):
                   assert(np.sum(np.abs((res[x]-n_gpus).asnumpy()))==0)
       print ("Passed")
   
   if __name__ == '__main__':
       test_nccl_pushpull(m_gpus)
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to