[GitHub] [spark] WeichenXu123 commented on pull request #40695: [SPARK-42994][ML][CONNECT] PyTorch Distributor support Local Mode with GPU

via GitHub Sun, 09 Apr 2023 19:31:25 -0700


WeichenXu123 commented on PR #40695:
URL: https://github.com/apache/spark/pull/40695#issuecomment-1501315354


   > yes, I feel it is non-trivial to execute pytorch code in server side, 
since we need to launch a new Python process in the server side and then 
communicate with it.
   
   I think it does not require too much work, we can reuse code of 
`TorchDistributor._run_distributed_training`, we just need to fix one issue:
   current master code does not handle GPU allocation correctly, we can 
broadcast the selected driver GPU list to all tasks and each task select its 
GPU id via task rank.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] WeichenXu123 commented on pull request #40695: [SPARK-42994][ML][CONNECT] PyTorch Distributor support Local Mode with GPU

Reply via email to