WeichenXu123 commented on PR #40695: URL: https://github.com/apache/spark/pull/40695#issuecomment-1501269024
On second thought, I propose to make TorchDistributor._run_local_training only supports spark legacy mode, but for ` TorchDistributor._run_distributed_training` , we make it support both legacy mode and spark connect mode, i.e., when running on spark local mode cluster, but user set `TorchDistributor.local_mode=False`, it executes `TorchDistributor._run_distributed_training`, in this case, current master code does not handle GPU allocation correctly, you need to fix it (we can broadcast the selected driver GPU list to all tasks and each task select its GPU id via task rank). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
