WeichenXu123 commented on PR #40695:
URL: https://github.com/apache/spark/pull/40695#issuecomment-1501269024

   On second thought,
   
   I propose to make TorchDistributor._run_local_training only supports spark 
legacy mode,
   
   but for ` TorchDistributor._run_distributed_training` , we make it support 
both legacy mode and spark connect mode, i.e., when running on spark local mode 
cluster, but user set `TorchDistributor.local_mode=False`, it executes 
`TorchDistributor._run_distributed_training`, in this case, current master code 
does not handle GPU allocation correctly, you need to fix it (we can broadcast 
the selected driver GPU list to all tasks and each task select its GPU id via 
task rank). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to