WeichenXu123 commented on PR #40695: URL: https://github.com/apache/spark/pull/40695#issuecomment-1501315354
> yes, I feel it is non-trivial to execute pytorch code in server side, since we need to launch a new Python process in the server side and then communicate with it. I think it does not require too much work, we can reuse code of `TorchDistributor._run_distributed_training`, we just need to fix one issue: current master code does not handle GPU allocation correctly, we can broadcast the selected driver GPU list to all tasks and each task select its GPU id via task rank. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
