mathewjacob1002 opened a new pull request, #41778: URL: https://github.com/apache/spark/pull/41778
### What changes were proposed in this pull request? This PR will add the new class `DeepspeedDistributor`. This class will use the deepspeed launcher to run distributed (and local) deepspeed applications. This PR focuses on basic boilerplate + infrastructure. Some tasks accomplished: 1. Creating a valid ssh token 2. Collecting the IPs of all worker nodes in the cluster 3. Using ssh-key to add those IPs to the .ssh/known_hosts file, which will allow deepspeed to use ssh to coordinate with worker nodes. ### Testing The notebook [here](https://e2-dogfood.staging.cloud.databricks.com/?o=6051921418418893#notebook/4159583338046450/command/4159583338046457) was used to test whether the current setup works. It ensures that ssh keys were created for the driver, that the worker ssh public keys were added to known_hosts. It also tests the clean up function that will be used internally. ### Next steps The next step in the project will be to try and figure out how many gpus to use on each worker node, given the total number of gpus the user wants to use. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
