mathewjacob1002 opened a new pull request, #41770:
URL: https://github.com/apache/spark/pull/41770

   ### What changes were proposed in this pull request?
   Implemented a distributed learning class meant for deepspeed workloads using 
the torch.distributed.run command. Also made some tests for some of the 
functions. Need to add tests for the distributed workloads in the 
create_torchrun_command. 
   
   
   ### Why are the changes needed?
   Special commands are needed for deepspeed workloads. This class makes it 
easier to run the deepspeed applications without ever needing to touch the 
terminal. If a user needs to use the torch.distributed.run launcher, this class 
will let them do that. This class also has a very similar API and workflow to 
the TorchDistributor class, where you simply create an instance and then invoke 
distributor.run(...).
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to