rahul003 commented on issue #8373: distribute training in fp16
URL: https://github.com/apache/incubator-mxnet/pull/8373#issuecomment-365386805
 
 
   @solin319 Which machines did you run the above numbers on? Let us try to 
come up with an easier interface for this so we can use this on the latest 
Nvidia Gpus. 
   
   Regarding 'I think merge the logic in '_init_kvstore_server_module' to the 
function 'kvstore.create' may be a better way to start server and worker.': 
   But this would mean that the server is created when the code calls 
kvstore.create(). This has the effect that we end up doing everything in the 
training script that was written before creation of kvstore. This could 
possibly allocate memory for data, model, etc. Or we have to instruct users to 
create the kvserver first (at which point the server process goes into a loop, 
so other code isn't run), but this seems hacky for an official way of doing 
things
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to